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Preface to the Dover Edition 


This edition contains minor corrections to the original edition. In the 28 years 
that have elapsed between these two editions, there have been great changes in 
computing equipment and in the development of numerical methods. However, 
the analysis required to understand and to devise new methods has not changed, 
and, thus, this somewhat mature text is still relevant. To the list of important 
topics omitted in the original edition (namely, linear programming, rational 
approximation and Monte Carlo) we must now add fast transforms, finite 
elements, wavelets, complexity theory, multigrid methods, adaptive gridding, 
path following and parallel algorithms. Hopefully, some energetic young 
numerical analyst will incorporate all these missing topics into an updated 
version to aid the burgeoning field of scientific computing. 

We thank the many people who have pointed out errors and misprints in the 
original edition. In particular, Mr. Carsten Eisner suggested an elegant 
improvement in our demonstration of the Runge phenomenon, which we have 
adopted in Problem 8 on page 280. 

Eugene Isaacson and Herbert B Keller 

New York and Pasadena 
July 1993 






Preface to the First Edition 


Digital computers, though mass produced for no more than fifteen years, 
have become indispensable for much current scientific research. One basic 
reason for this is that by implementing numerical methods, computers 
form a universal tool for “solving” broad classes of problems. While 
numerical methods have always been useful it is clear that their role in 
scientific research is now of fundamental importance. No modern applied 
mathematician, physical scientist, or engineer can be properly trained 
without some understanding of numerical methods. 

We attempt, in this book, to supply some of the required knowledge. In 
presenting the material we stress techniques for the development of new 
methods. This requires knowing why a particular method is effective on 
some problems but not on others. Hence we are led to the analysis of 
numerical methods rather than merely their description and listing. 

Certainly the solving of scientific problems should not be and is not 
the sole motivation for studying numerical methods. Our opinion is that 
the analysis of numerical methods is a broad and challenging mathematical 
activity whose central theme is the effective constructibility of various 
kinds of approximations. 

Many numerical methods have been neglected in this book since we do 
not attempt to be exhaustive. Procedures treated are either quite good and 
efficient by present standards or else their study is considered instructive 
(while their use may not be advocated). Unfortunately the limitations of 
space and our own experience have resulted in the omission of many 
important topics that we would have liked to include (for example, linear 
programming, rational approximation, Monte Carlo methods). 

The present work, it turns out, could be considered a mathematics 
text in selected areas of analysis and matrix theory. Essentially no 

vii 
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mathematical preparation beyond advanced calculus and elementary linear 
algebra (or matrix theory) is assumed. Relatively important material on 
norms in finite-dimensional spaces, not taught in most elementary courses, 
is included in Chapter 1. Some familiarity with the existence theory for 
differential equations would be useful, but is not necessary. A cursory 
knowledge of the classical partial differential equations of mathematical 
physics would help in Chapter 9. No significant use is made of the theory 
of functions of a complex variable and our book is elementary in that 
sense. Deeper studies of numerical methods would also rely heavily on 
functional analysis, which we avoid here. 

The listing of algorithms to concretely describe a method is avoided. 
Hence some practical experience in using numerical methods is assumed 
or should be obtained. Examples and problems are given which extend 
or amplify the analysis in many cases (starred problems are more difficult), 
ft is assumed that the instructor will supplement these with computational 
problems, according to the availability of computing facilities. 

References have been kept minimal and are usually to one of the general 
texts we have found most useful and compiled into a brief bibliography. 
Lists of additional, more specialized references are given for the four 
different areas covered by Chapters 1-4, Chapters 5-7, Chapter 8, and 
Chapter 9. A few outstanding journal articles have been included here. 
Complete bibliographies can be found in several of the general texts. 

Key equations (and all theorems, problems, and figures) are numbered 
consecutively by integers within each section. Equations, etc., in other 
sections are referred to by a decimal notation with explicit mention of the 
chapter if it is not the current one [that is, equation (3.15) of Chapter 5], 
Yielding to customary usage we have not sought historical accuracy in 
associating names with theorems, methods, etc. 

Several different one-semester and two-semester courses have been 
based on the material in this book. Not all of the subject matter can be 
covered in the usual one-year course. As examples of some plans that have 
worked well, we suggest: 

Two-semester courses: 

(A) Prerequisite—Advanced Calculus and Linear Algebra, Chapters 1-9; 

(B) Prerequisite—Advanced Calculus (with Linear Algebra required only 
for the second semester), Chapters 3, 5-7, 8 (through Section 3), 
1,2, 4, 8, 9. 

One-semester courses: 

(A) Chapters 3, 5-7, 8 (through Section 3); 

(B) Chapters 1-5; 
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(C) Chapters 8, 9 (plus some material from Chapter 2 on iterative 
methods). 

This book benefits from our experience in trying to teach such courses 
at New York University for over fifteen years and from our students’ 
reactions. Many of our former and present colleagues at the Courant 
Institute of Mathematical Sciences are responsible for our education in 
this field. We acknowledge our indebtedness to them, and to the stimulat¬ 
ing environment of the Courant Institute. Help was given to us by our 
friends who have read and used preliminary versions of the text. In this 
connection we are happy to thank Prof. T. E. Hull, who carefully read 
our entire manuscript and offered much constructive criticism; Dr. 
William Morton, who gave valuable suggestions for Chapters 5-7; Pro¬ 
fessor Gene Golub, who helped us to improve Chapters 1, 2, and 4. We 
are grateful for the advice given us by Professors H. O. Kreiss, Beresford 
Parlett, Alan Solomon, Peter Ungar, Richard Varga, and Bernard Levinger, 
and Dr. Olof Widlund. Thanks are also due to Mr. Julius Rosenthal and 
Dr. Eva Swenson who helped in the preparation of mimeographed lecture 
notes for some of our courses. This book grew from two sets of these 
notes upon the suggestion of Mr. Earle Brach. We are most grateful to 
Miss Connie Engle who carefully typed our manuscript and to Mr. 
Richard Swenson who helped in reading galleys. Finally, we must thank 
Miss Sallyanne Riggione, who as copy editor made many helpful sug¬ 
gestions to improve the book. 


New York and Pasadena 
April , 1966 


E. Isaacson and H. B. Keller 
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Norms, Arithmetic, and 
Well-Posed Computations 


0. INTRODUCTION 

In this chapter, we treat three topics that are generally useful for the 
analysis of the various numerical methods studied throughout the book. 
In Section 1, we give the elements of the theory of norms of finite dimen¬ 
sional vectors and matrices. This subject properly belongs to the field of 
linear algebra . In later chapters, we may occasionally employ the notion 
of the norm of a function. This is a straightforward extension of the 
notion of a vector norm to the infinite-dimensional case. On the other 
hand, we shall not introduce the corresponding natural generalization, 
i.e., the notion of the norm of a linear transformation that acts on a 
space of functions. Such ideas are dealt with in functional analysis , and 
might profitably be used in a more sophisticated study of numerical 
methods. 

We study briefly, in Section 2, the practical problem of the effect of 
rounding errors on the basic operations of arithmetic. Except for calcula¬ 
tions involving only exact-integer arithmetic, rounding errors are in¬ 
variably present in any computation. A most important feature of the 
later analysis of numerical methods is the incorporation of a treatment 
of the effects of such rounding errors. 

Finally, in Section 3, we describe the computational problems that are 
“reasonable” in some general sense. In effect, a numerical method which 
produces a solution insensitive to small changes in data or to rounding 
errors is said to yield a well-posed computation. How to determine the 
sensitivity of a numerical procedure is dealt with in special cases through¬ 
out the book. We indicate heuristically that any convergent algorithm is a 
well-posed computation. 
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2 NORMS, ARITHMETIC, AND WELL-POSED COMPUTATIONS [Ch. 1] 

1. NORMS OF VECTORS AND MATRICES 

We assume that the reader is familiar with the basic theory of linear 
algebra, not necessarily in its abstract setting, but at least with specific 
reference to finite-dimensional linear vector spaces over the field of com¬ 
plex scalars. By “basic theory” we of course include: the theory of linear 
systems of equations, some elementary theory of determinants, and the 
theory of matrices or linear transformations to about the Jordan normal 
form. We hardly employ the Jordan form in the present study. In fact 
a much weaker result can frequently be used in its place (when the divisor 
theory or invariant subspaces are not actually involved). This result is all 
too frequently skipped in basic linear algebra courses, so we present it as 

theorem 1 . For any square matrix A of order n there exists a non¬ 
singular matrix P, of order n , such that 

B = P-^AP 

is upper triangular and has the eigenvalues of A , say \ = A/T), j — 1, 
2 ,on the principal diagonal (i.e., any square matrix is equivalent to a 
triangular matrix). 

Proof We sketch the proof of this result. The reader should have no 
difficulty in completing the proof in detail. 

Let A* be an eigenvalue of A with corresponding eigenvector u x .f Then 
pick a basis for the ^-dimensional complex vector space, C n , with U! as 
the first such vector. Let the independent basis vectors be the columns of a 
non-singular matrix P l9 which then determines the transformation to the 
new basis. In this new basis the transformation determined by A is given 
by B x = P 1 ~ 1 AP l and since = A^, 



where A 2 is some matrix of order n — 1. 

The characteristic polynomial of B x is clearly 

det (A I n - B x ) = (A - AJ det (A/„_ 1 - A 2 ), 


t Unless otherwise indicated, boldface type denotes column vectors. For example, an 
/7-dimensional vector u k has the components u ik ; i.e., 
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where I n is the identity matrix of order Now pick some eigenvalue A 2 
of A 2 and corresponding (n — l)-dimensional eigenvector, v 2 ; i.e., 

A 2 y 2 ~ A 2 v 2 . 

With this vector we define the independent ^-dimensional vectors 



Note that with the scalar a = <* 1^12 + oc 2 v 22 + • • • 4- a n -iV n -i t 2 
B 1 Uj = AiUi, = A 2 u 2 4- au l5 

and thus if we set u x = P 1 u x , u 2 = P 1 u 2 , then 

An ! = AiUu An 2 — A 2 u 2 + au x . 

Now we introduce a new basis of C n with the first two vectors being u x 
and u 2 . The non-singular matrix P 2 which determines this change of basis 
has u x and u 2 as its first two columns; and the original linear transformation 
in the new basis has the representation 

IK x X ••• x\ 

/ 0 A 2 * • • • * \ 

^2 = Pi = I 0 0 1} 


where A 3 is some matrix of order n — 2. 

The theorem clearly follows by the above procedure; a formal inductive 
proof could be given. ■ 

It is easy to prove the related stronger result of Schur stated in Theorem 
2.4 of Chapter 4 (see Problem 2.13(b) of Chapter 4). We turn now to the 
basic content of this section, which is concerned with the generalization 
of the concept of distance in ^-dimensional linear vector spaces. 

The “distance” between a vector and the null vector, i.e., the origin, 
is a measure of the “size” or “length” of the vector. This generalized 
notion of distance or size is called a norm. In particular, all such general¬ 
izations are required to have the following properties: 

(0) To each vector x in the linear space, IT, say, a unique real number is 
assigned; this number, denoted by ||x|| or N(x), is called the norm of 
x iff: 
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(i) ||x I > 0 for all x e Y and j|x|| = 0 iff x — o; 

where o denotes the zero vector (if Y = C n , then o t = C); 

(ii) ||ax || = |ce| * || x || for all scalars a and all x e Y; 

(iii) ||x + y| < ||x|| 4- ||y||, the triangle inequality ,f for all x, y gY . 

Some examples of norms in the complex ^-dimensional space C n are 


n 


(la) 

IMIi 

= N,(x) = 2 Ml. 

1 = 1 


(lb) 

Ml a 

III 

to 

g 

III 

iM 3 

TT 

to 


(lc) 

IMII*. 

/ n \ 1/P 

= N v (x) = (2 Ml”] . 

p> i. 

(Id) 

ML 

= N m (x) = max \x<\. 



It is an easy exercise for the reader to justify the use of the notation in 
(Id) by verifying that 

lim N p (\) = A^o(x). 

p —» oo 

The norm, fl*^, is frequently called the Euclidean norm as it is just the 
formula for distance in ordinary three-dimensional Euclidean space 
extended to dimension n. The norm, ||*||«>, is called the maximum norm 
or occasionally the uniform norm. In general, || • || p , for p > 1 is termed 
the p-norm. 

To verify that (1) actually defines norms, we observe that conditions 
(0), (i), and (ii) are trivially satisfied. Only the triangle inequality, (iii), 
offers any difficulty. However, 

n 

Ni(x + y) = 2 M + y >I 

i = l 

^ 2 (Ml + Ml) = 2 Ml + 2 \y /1 

1 = l i = l ; -1 

= N,(x) + N,( y); 

t For complex numbers x and y the elementary inequality \x + y\ < |xj + |>j 
expresses the fact that the length of any side of a triangle is not greater than the sum 
of the lengths of the other two sides. 
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N oo(x + y) = max \x, + y, | 

< max (|x,| + |j>y|) < max \x f \ + max \y k \ 

i j fc 

= A^x) + N 0 o(y), 
so (la) and (Id) define norms. 

The proof of (iii) for (lb), the Euclidean norm, is based on the well- 
known Cauchy-Schwarz inequality which states that 

(2) 2 X * y \ - (Ji I*'! 2 ) ( y 2 l^l 2 ) = N 2 (x)N 2 (y). 

To prove this basic result, let |x| and |y[ be the ^-dimensional vectors 
with components \x f \ and \yj\,j — 1, 2,.. n , respectively. Then for any 
real scalar, 

o < iv*(f|*| + |y |) = I*,I 2 + 2t2 MW + 2 W 2 - 

l i = l i = l 

But since the real quadratic polynomial in £ above does not change sign 
its discriminant must be non-positive; i.e., 

(S N-w)’ S (| W’) (i I*l‘). 

However, we note that 

n 12 in \ 2 

2*^ * (2 ww) > 

and (2) follows from the above pair of inequalities. 

Now we form 

N 2 (x + y) = (2 I x i + J'il 2 )' 2 = (2 W + + J'/))** 

= (2 W 2 + 2 + *w + 2 W 2 ) 

V=l i= 1 j = 1 / 

< ^ 2 2 (x) + 2 2 W-W + w 2 2 (y)j ! - 

An application of the Cauchy-Schwarz inequality yields finally 
w 2 (x + y) < N 2 (x) + N 2 ( y) 

and so the Euclidean norm also satisfies the triangle inequality. 
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The statement that 

(3) N p (x + y) < N p (x) + N,( y), p > 1, 

is know as Minkowski's inequality. We do not derive it here as general 
/?-norms will not be employed further. (A proof of (3) can be found in 
most advanced calculus texts.) 

We can show quite generally that all vector norms are continuous 
functions in C n . That is, 

lemma 1. Every vector norm , N(\), is a continuous function of x l9 x 2 ,. . 
jt n , the components of x. 

Proof For any vectors x and 5 we have by (iii) 

N{x + 8) < N{x) + Ni 5), 

so that 

N{x + 5) - N(x) < Nib). 

On the other hand, by (ii) and (iii), 

Nix) — Nix + 5 — 5) 

< jV(x + 5) + Nib), 

so that 

-Nib) < Nix + 5) - Nix). 

Thus, in general 

| N{x + 5) - Nix )| < Nib). 

With the unit vectorsf {e k }, any 6 has the representation 

8=2 
k= 1 

Using (ii) and (iii) repeatedly implies 
(4a) N( 8) < 2 N(8 k e k ) 

fc= 1 

< J \8 k \N(e k ) 

k*= 1 

< max |8 fc | 2 N ( e i) 

k j =i 

= JI/AU8), 

f e k has the components e {k , where e lk = 0 , i ^ k; e klc = 1. 
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M = 2 *(•>)• 

7 = 1 

Using this result in the previous inequality yields, for any e > 0 and all 8 
with N n ( 5) < €/M, 

|jV(x + 8) — N(x) | < e. 

This is essentially the definition of continuity for a function of the n 
variables x u x 2 ,..., x n . ■ 

See Problem 6 for a mild generalization. 

Now we can show that all vector norms are equivalent in the sense of 

theorem 2. For each pair of vector norms , say N(x) and N'(x ), there exist 
positive constants m and M such that for all xeC n : 

mN\x) < N(x) < MN'(x). 

Proof The proof need only be given when one of the norms is N 00 , 
since N and N' are equivalent if they are each equivalent to N *. Let 
S C C n be defined by 

S s (x | ^(x) = 1, x e C n } 

(this is frequently called the surface of the unit ball in C n ). S is a closed 
bounded set of points. Then since N(x) is a continuous function (see 
Lemma 1), we conclude by a theorem of Weierstrass that N(x) attains its 
minimum and its maximum on S at some points of S . That is, for some 
x° e S and x 1 e S 

N(x°) — min jV(x), TV^x 1 ) — max jV(x) 

xeS x e S 

or 

0 < tV(x 0 ) < ^(x) < ^(X 1 ) < 00 

for all x g S. 

For any y ^ o we see that y/A r «,(y) is in S and so 

" (xt) s 
or 

N(x°)N„(y) < N(y) < N(x l )NJy). 

The last two inequalities yield 

*M,(y) < N( y) < MNJy), 

where m = N(x°) and M = ^(x 1 ). ■ 
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A matrix of order n could be treated as a vector in a space of dimension 
n 2 (with some fixed convention as to the manner of listing its elements). 
Then matrix norms satisfying the conditions (O)-(iii) could be defined as in 
(1). However, since the product of two matrices of order n is also such a 
matrix, we impose an additional condition on matrix norms, namely that 

(iv) \AB\ < MU-1*11. 

With this requirement the vector norms (1) do not all become matrix 
norms (see Problem 2). However, there is a more natural, geometric, 
way in which the norm of a matrix can be defined. Thus, if x e C n and || • || 
is some vector norm on C n , then ||x|| is the “length” of x, ||^x][ is the 
“length” of Ax, and we define a norm of A, written as ||T|[ or N(A), by 
the maximum relative “stretching,” 

( 5 ) ll^ll s sup T^T' 

Note that we use the same notation, ||*||, to denote vector and matrix 
norms; the context will always clarify which is implied. We call (5) a 
natural norm or the matrix norm induced by the vector norm , || • ]|. This is 
also known as the operator norm in functional analysis. Since for any 
x ^ o we can define u — x/||x|| so that ||u|| = 1, the definition (5) is 
equivalent to 

(6) Mil = max Mull = My||, IMI = 1- 

Hull = 1 

That is, by Problems 6 and 7, ||^u|| is a continuous function of u and hence 
the maximum is attained for some y, with ||y|| = 1. 

Before verifying the fact that (5) or (6) defines a matrix norm, we note 
that they imply, for any vector x, that 

(7) Mx|| ^ Mil-INI- 

There are many other ways in which matrix norms may be defined. 
But if (7) holds for some such norm then it is said to be compatible with 
the vector norm employed in (7). The natural norm (5) is essentially the 
“smallest” matrix norm compatible with a given vector norm. 

To see that (5) yields a norm, we first note that conditions (i) and (ii) 
are trivially verified. For checking the triangle inequality, let y be such 
that ||y|| = 1 and from (6), 

IIM + B) || = IM + B) y||. 

But then, upon recalling (7), 

M + *11 ^ MyII + 8*y|| 

^ Mil + 11*1- 
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Finally, to verify (iv), let y with ||y|| = 1 now be such that 

IM*)I = ll(^)y||. 

Again by (7), we have 

\m\ < M||-|| fly || 

^ Mil-Mil, 

so that (5) and (6) do define a matrix norm. 

We shall now determine the natural matrix norms induced by some of 
the vector /7-norms (p ~ 1, 2, oo) defined in (1). Let the nth order matrix 
A have elements a jki /, k = 1,2 
(A) The matrix norm induced by the maximum norm (Id) is 


( 8 ) 


— max 

* k = l 


2 \ a *\' 


i.e., the maximum absolute row sum. To prove (8), let y be such that 
II y L = 1 and 


Then, 


= max 

j 


MIL = MylL- 

2 a *y* ^ max ( 2 \a, k \ |y fc |) 

k =l * \fc-l / 


< max \y k \ - max V \a jk \ = |y|L-max 2 Wik\ 
* i k = i i ic = i 


— max 

; k^ 1 


2 \ a &\> 


so the right-hand side of (8) is an upper bound of ||/4||«>- Now if the 
maximum row sum occurs for, say, / — J then let x have the components 


*k 


( a Jk/\ a Jk\i a Jk 7 ^ 0 

(0, a Jk = 0, k 


1, 2,. . n. 


Clearly [|x|| ^ = 1, if A is non-trivial, and 

M*IL = 2 M s Mlf®» 

k= 1 

so (8) holds. [If A = O, property (ii) implies ||>4|| = 0 for any natural 
norm.] ■ 

(B) Next, we claim that 

n 

Mill = max 2 

k /Ti 


( 9 ) 
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i.e., the maximum absolute column sum. Now let ||y|| x = 1 and be such 
that 

Mill = Mill. 

Then, 

n I n I n n 

Mill =12 - 2 2 \ a ik\-\yk\ 

i = 1 I k = 1 I i ~ 1 k = 1 

n/n \ ft / n \ 

= 2 (bfci 2 M) ^ 2 bfci( max 2 M) 

k= i \ i= i / fc = l \ m /= i / 

n n 

= ||y|i max 2 M = max 2 k«l» 

m j=1 m j=1 

and the right-hand side of (9) is an upper bound of ||/*||i. If the maximum 
is attained for m = K, then this bound is actually attained for x = e*, 
the Kth unit vector, since [|e^ || x = 1 and 


M e J 


2 2 


a jk^kK 


Thus (9) is established. 


n 


— 2 

i = 1 


(C) Finally, we consider the Euclidean norm, for which case we recall 
the notation for the Hermitian transpose or conjugate transpose of any 
rectangular matrix A = 

A* s A T , 


i.e., if A* 3 (b h ), then b u — a jt . Further, the spectral radius of any square 
matrix A is defined by 

(10) p(A) m max |A S M)|, 

S 

where A s (/I) denotes the ^th eigenvalue of A. Now we can state that 

(11) MU 2 = Vp(A*A). 

To prove (11), we again pick y such that ||y|| 2 = 1 and 

Ml. = My II2. 

From (lb) it is clear that ||x[| 2 2 = x*x, since x* 3 (x u x 2 ,..., 3t n ). 
Therefore, from the identity (/ty)* = y*A*, we find 

(12) MIU 2 = My II a 2 = My)*My) 

= y*A*Ay. 
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But since A* A is Hermitian it has a complete set of n orthonormal eigen¬ 
vectors, say u ly u 2 ,..., u n , such that 

(13a) Uy* u fc = S jk9 

(13b) A*Au s = A s u s . 

The multiplication of (l3b) by u s * on the left yields further 

A s = > 0. 

Every vector has a unique expansion in the basis {u s }. Say in particular 
that 

n 

y = 2 “ sUs ’ 

s= l 

and then (12) becomes, upon recalling (13), 

Mil a 2 = 2 d tU t *A*A 2 “s u s 

(=1 s = 1 

n n n 

= 2 “ iUi * 2 “« A «u, = 2 A *W 2 

t= 1 S = 1 S = 1 

n 

< max A s T |a t | 2 — max A s = p(A*A). 

s t = i s 

Thus p' /2 (A*A) is an upper bound of \\A\\ 2 . However, using y = u s , where 
A s = p(A*A\ we get 

|| a = (u*A*Au,) y ‘ 

= pHa*a), 

and so (11) follows. ■ 

We have observed that a matrix of order n can be considered as a vector 
of dimension n 2 . But since every matrix norm satisfies the conditions 
(O)-(iii) of a vector norm the results of Lemma 1 and Theorem 2 also 
apply to matrix norms. Thus we have 

lemma 1'. Every matrix norm , || >4 [|, is a continuous function of the n 2 
elements a i} of A. ■ 

theorem 22 For each pair of matrix norms , say ||/J|| and H^H', there 
exist positive constants m and M such that for all «th order matrices A 

m\\A\\ f < M|| < M\\A\\\ ■ 

The proofs of these results follow exactly the corresponding proofs for 

vector norms so we leave their detailed exposition to the reader. 
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There is frequently confusion between the spectral radius (10) of a 
matrix and the Euclidean norm (11) of a matrix. (To add to this confusion, 
||i4|| 2 is sometimes called the spectral norm of A.) It should be observed 
that if A is Hermitian, i.e., A* = A , then A s (/t*/t) = A s 2 (/I) and so the 
spectral radius is equal to the Euclidean norm for Hermitian matrices . 
However, in general this is not true, but we have 

lemma 2. For any natural norm , || • ||, and square matrix , A , 

p{A) < Mil. 

Proof. For each eigenvalue A s (>t) there is a corresponding eigenvector, 
say u s , which can be chosen to be normalized for any particular vector 
norm, ||u s || = 1. But then for the corresponding natural matrix norm 

Mil = max MyII ^ IM«sll = ||A s u s || = |A S |. 

Ityll = 1 

As this holds for all ^ = 1, 2, • ■ *, w, the result follows. ■ 

On the other hand, for each matrix some natural norm is arbitrarily 
close to the spectral radius. More precisely we have 

theorem 3. For each nth order matrix A and each arbitrary e > 0 
a natural norm , ||/4||, can be found such that 

p(A) < IMII < P(A) + e. 

Proof The left-hand inequality has been verified above. We shall show 
how to construct a norm satisfying the right-hand inequality. By Theorem 
1 we can find a non-singular matrix P such that 

PAP 1 == B s A + U 


where A = (A/^)S j; ) and U = (« f; ) has zeros on and below the diagonal. 
With 8 > 0, a “sufficiently small” positive number, we form the diagonal 
matrix of order n 


D EE 


Now consider 

C = DBD 1 = A + £, 
where F = (e iy ) = DUD~ 1 has elements 

fo. J ^ ‘ 

” j > i, 1 = 1,2 . n. 
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Note that the elements e {j can be made arbitrarily small in magnitude by 
choosing 8 appropriately. Also we have that 

A = P~ l D l CDP . 

Since DP is non-singular, a vector norm can be defined by 
|| x !| = N 2 (DP\) = ( x*P*D*DPx) y \ 

The proof of this fact is left to the reader in Problem 5. The natural 
matrix norm induced by this vector norm is of course 

Mil = max My||. 

liyil = 1 

However, from the above form for A , we have, for any y, 

My|| = N 2 (DPAy) = N 2 (CDPy). 

If we let z = DPy, this becomes 

My II = n 2 (Cz) = (z*c*Cz)' /z . 

Now observe that 

C*C = (A* + £*)(A -1- E) 

= A* A + 

Here the term represents an nth order matrix each of whose terms is 
0(8).| Thus, we can conclude that 

z*C*Cz < max |A S 2 (^)| z*z + \z*J?(8)z\ 

s 

< [p 2 (A) + 0(S)]z*z, 

since 

\z**J?(S)z\ < n 2 z*z(P(8) = z*z0(8). 

Recalling |y|| = N 2 (z), we find from ||y|| = l that z*z = 1. Then it 
follows that 

Mil < [p 2 (A) + 0(8)]* 

= P (A) + 0(8). 

For 8 sufficiently small 0(8) < €. ■ 

t A quantity, say f y is said to be 0(S), or briefly / = 0(5) iff for some constants K > 0 
and 5 0 > 0, 

I/I < K |5 |, for |5| < 8 0 . 
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It should be observed that the natural norm employed in Theorem 3 
depends upon the matrix A as well as the arbitrary small parameter e. 
However, this result leads to an interesting characterization of the spectral 
radius of any matrix; namely, 

corollary. For any square matrix A 

p(A) = inf / max N(/lx)\ 

{N()}\tf<x) = l / 

where the inf is taken over all vector norms , N(-); or equivalently 

p(A ) = inf ||y4|| 

(H I'} 

where the inf is taken over all natural norms , || ♦ |. 

Proof. By using Lemma 2 and Theorem 3, since € > 0 is arbitrary and 
the natural norm there depends upon e, the result follows from the 
definition of inf. ■ 

1.1. Convergent Matrices 

To study the convergence of various iteration procedures as well as 
for many other purposes, we investigate matrices A for which 

(14) lim A m = O , 

m -*■ oo 

where O denotes the zero matrix all of whose entries are 0. Any square 
matrix satisfying condition (14) is said to be convergent . Equivalent 
conditions are contained in 

theorem 4. The following three statements are equivalent: 

(a) A is convergent; 

(b) lim ||/I m || = 0, for some matrix norm; 

m -»• oo 

(c) P (A) < 1. 

Proof We first show that (a) and (b) are equivalent. Since || • || is 
continuous, by Lemma V, and \\0\\ = 0, then (a) implies (b). But if (b) 
holds for some norm, then Theorem 2' implies there exists an Msuch that 

MIL < M\\A m II ->0. 

Hence, (a) holds. 

Next we show that (b) and (c) are equivalent. Note that by Theorem 2 ; 
there is no loss in generality if we assume the norm to be a natural norm. 
But then, by Lemma 2 and the fact that A (A m ) — A m (A), we have 

M*|| ^ p(A m ) = P m (A), 
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so that (b) implies (c). On the other hand, if (c) holds, then by Theorem 3 
we can find an € > 0 and a natural norm, say iV(*), such that 

N(A) < p{A) + e = 6 < 1. 

Now use the property (iv) of matrix norms to get 

N(A m ) < [A(T)] m < 6 m 

so that lim N(A m ) = 0 and hence (b) holds. ■ 

m -*■ oo 

A test for convergent matrices which is frequently easy to apply is the 
content of the 

corollary. A is convergent if for some matrix norm 


Proof Again by (iv) we have 

mii !£ Mr 

so that condition (b) of Theorem 4 holds. ■ 

Another important characterization and property of convergent 
matrices is contained in 

theorem 5. (a) The geometric series 

/+ A + A 2 + A 3 + 
converges iff A is convergent. 

(b) If A is convergent , then I — A is non-singular and 
(/- A )" 1 = /+ T + T 2 + /4 3 +--*. 

Proof A necessary condition for the series in part (a) to converge is 
that lim A m — O , i.e., that A be convergent. The sufficiency will follow 

m-» oo 

from part (b). 

Let A be convergent, whence by Theorem 4 we know that p(A) < 1. 
Since the eigenvalues oil — A are 1 — A(T), it follows that det (/ — A) # 0 
and hence this matrix is non-singular. Now consider the identity 

(I - A)(I + A + A 2 + • • • + A m ) = / - A m+1 

which is valid for all integers m. Since A is convergent, the limit as m -> oo 
of the right-hand side exists. The limit, after multiplying both sides on the 
left by (/ - A ) _1 , yields 

(/+ A + ) - (/— A)- 1 

and part (b) follows. ■ 
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A useful corollary to this theorem is 

corollary. If in some natural norm, |j^|[ < 1, then I — A is non-singular 
and 


1 

i + Mil 


^ ll(/ - '0~ 1 ll < 


l 

i - mu' 


Proof. By the corollary to Theorem 4 and part (b) of Theorem 5 it 
follows that / — A is non-singular. For a natural norm we note that 
|j/|| = 1 and so taking the norm of the identity 


I = (I - A)(I - A)~' 

yields 

1 < ||(/ - ^)H(/ - A)-'\\ 

<d + MIDIK/-^)- 1 !- 

Thus the left-hand inequality is established. 

Now write the identity as 

(/ - A)- 1 = / + A(I — A)- 1 
and take the norm to get 

H(/_ / 0 -i|| < i + 

Since ||^t| < 1 this yields 

It should be observed that if A is convergent, so is ( — A), and ||/t|| = 
|| — A\\. Thus Theorem 5 and its corollary are immediately applicable to 
matrices of the form / + A. That is, if in some natural norm, ||/f|| < 1, 
then 


1 

i + Mil - 


!(/ + /«)- l ll 


l 

" i - MU' 


PROBLEMS, SECTION 1 

1. (a) Verify that (lb) defines a norm in the linear space of square matrices 
of order n ; i.e., check properties (i)-(iv), for M| fi 2 = 2 \ a u\ 2 - 

ij 

(b) Similarly, verify that (la) defines a matrix norm, i.e., ||/t|| = 

2 M- 

iJ 

2. Show by example that the maximum vector norm, r)(A) = max |a w |, 

when applied to a matrix, does not satisfy condition (iv) that we impose on a 
matrix norm. 
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3. Show that if A is non-singular, then B = A*A is Hermitian and positive 
definite. That is, x*Bx > 0 if x ^ o. Hence the eigenvalues of B are all positive. 

4. Show for any non-singular matrix A and any matrix norm that 

||/|| 2: 1 and M" 1 ! 2: p|- 
[Hint: ||/|| = [|//|| < ||/|| 2 ; < M“ 1 II - Mil-] 

5. Show that if ^(x) is a norm and A is any non-singular matrix, then N{x) 
defined by 

N(x) = v (Ax) y 

is a (vector) norm. 

6. We call t?(x) a semi-norm iff r)(x) satisfies all of the conditions, (O)-(iii), 
for a norm with condition (i) replaced by the weaker condition 

(i'): i 7 (x) > 0 for all xe/. 

We say that 7](x) is non-trivial iff rj{x) > 0 for some x e . Prove the follow¬ 
ing generalization of Lemma 1: 

lemma V. Every non-trivial semi-norm , t?(x), is a continuous function of 
Xi , x 2 , .. x ny the components of x. Hence every semi-norm is continuous, 

7. Show that if ^(x) is a semi-norm and A any square matrix, then N(x) = 
i 7 (/lx) defines a semi-norm. 


2. FLOATING-POINT ARITHMETIC AND ROUNDING ERRORS 

In the following chapters we will have to refer, on occasion, to the errors 
due to “rounding” in the basic arithmetic operations. Such errors are 
inherent in all computations in which only a fixed number of digits are 
retained. This is, of course, the case with all modern digital computers and 
we consider here an example of one way in which many of them do or 
can do arithmetic; so-called floating-point arithmetic. Although most 
electronic computers operate with numbers in some kind of binary 
representation, most humans still think in terms of a decimal representation 
and so we shall employ the latter here. 

Suppose the number a # 0 has the exact decimal representation 

(1) a = ±m.d 1 d 2 >--) 

where q is an integer and the d u d 2 ,..., are digits with d x ^ 0. Then 
the “/-digit floating-decimal representation of a,” or for brevity the 
“floating a” used in the machine, is of the form 

(2) fi(a) = ±10«(.M 2 ..*8 £ ) 


where 8 X ^ 0 and 8 2y ..,,8 t are digits. The number (.8 t 8 2 - • -8<) is 
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called the mantissa and q is called the exponent of fl(a). There is usually a 
restriction on the exponent, of the form 

(3) -N <q < M, 

for some large positive integers N, M. If a number a =£ 0 has an exponent 
outside of this range it cannot be represented in the form (2), (3). If, 
during the course of a calculation, some computed quantity has an ex¬ 
ponent q > M (called overflow) or q < —N (called underflow), meaningless 
results usually follow. However, special precautions can be taken on most 
computers to at least detect the occurrence of such over- or underflows. 
We do not consider these practical difficulties further; rather, we shall 
assume that they do not occur or are somehow taken into account. 

There are two popular ways in which the floating digits 8, are obtained 
from the exact digits, d } . The obvious chopping representation takes 

(4) S, = d h j = 1,2,...,/. 

Thus the exact mantissa is chopped off after the ah decimal digit to get the 
floating mantissa. The other and preferable procedure is to round , in 
which casef 

(5) ^i^2* ■ *“ [did 2 * • ■ d t .d t +1 + 0.5] 

and the brackets on the right-hand side indicate the integral part. The 
error in either of these procedures can be bounded as in 

lemma 1. The error in t-digit floating-decimal representation of a number 
a ^ 0 is bounded by 


| a — fl(a)| < 5|a| 10 l p 


(p — 1, rounded , 
\p = 2, chopped . 


Proof From (1), (2), and (4) we have 


a 

1 

II 

O 

*( d t + id t + 2 ■ 


= 10"- 

dt + idt + 2 * 


= 10- 1 

* ( d t + id t + 2 - ■ 

•) 

{■dA ,•••; 

) 


W 

\a\ 


a. 


t For simplicity we are neglecting the special case that occurs when d x — d 2 = • ■ • = 
dt — 9 and d t + 1 > 5. Here we would increase the exponent q in (2) by unity and set 
81 = l, Bj —0 1 j > 1. Note that when d t + l = 5, if we were to round up iff d t is odd, 
then an unbiased rounding procedure would result. Some electronic computers 
employ an unbiased rounding procedure (in a binary system). 
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But since 1 < d x < 9 and 0.d t + 1 d t + 2 - • • < 1 this implies 


|a — fl(<z)| < 10 1 l \a\, 

which is the bound for the chopped representation. For the case of round¬ 
ing we have, similarly, 

|a - fl(a)[ < i 10"-' = i 10"- 1 M < 5|a|10-‘. ■ 

\ a \ 

We shall assume that our idealized computer performs each basic 
arithmetic operation correctly to 2 1 digits and then either rounds or chops 
the result to a r-digit floating number. With such operations it clearly 
follows from Lemma 1 that 


(6a) fi(a ± b) - (a ± b)( 1 + </>10 
(6b) fi(ab) = a-b(\ + <f>\0- 1 ) 

(6c) flg) =^(1+^10-*) 


0 < |^| < 5 rounding, 
0 < |^| < 10 chopping. 


In many calculations, particularly those concerned with linear systems, 
the accumulation of products is required (e.g., the inner product of two 
vectors). We assume that rounding (or chopping) is done after each 
multiplication and after each successive addition. That is, 


(7a) fl(aA + a 2 b 2 ) = [tfA(l + ^il0“‘) 

+ a 2 b 2 ( 1 + * 8 10“OI(1 + <910*0 

and in general 

(7b) fig 0&) = fl flf|; + fl(flA) . 


The result of such computations can be represented as an exact inner 
product with, say, the a t slightly altered. We state this as 

lemma 2 . Let the floating-point inner product (7) be computed with round¬ 
ing. Then if n and t satisfy 


( 8 ) 

it follows that 

(9a) 

where 


kIO 1 -' < 1 



(9b) 18a!| < H^llO 1 f , |8a,| <(«-; + 2)ja f j 10 1 _t , 

i = 2, 3,.. 
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Proof. By (6b) we can write 

flOA) = a AO + 4io - '), 141 ^ 5, 

since rounding is assumed. Similarly from (6a) and (7b) with n — k we 
have 

fl(i a - 6 ‘) + («AX1 + 410-') (1 + tffcio-0 

where 

4 = 0; |4| < 5, k = 2,3. 

Now a simple recursive application of the above yields 

fl( 1 «a) = 1 [a AO + 410-') fl (l + 410-') 

\i = 1 / fc=lL ;=fc 

n 

s 2 fl A0 + 4), 

te = l 

where we have introduced by 

1+4=0+ 4io-‘) fl (l + 4io-‘). 

j ~ k 

A formal verification of this result is easily obtained by induction. 

Since 6 1 — 0, it follows that 

(1 - 5-10"O n_fc+2 < 1 + E k < (1 + 5 • 10"O n_fc+2 , k = 2, 3, 
and 

(1 - 5-iO‘O 71 < 1 + < (1 + 5* 10 _t )\ 

Hence, with e = 5 -10“*, 

|4| < (l + 0" - l, 

|4| < (1 + e) n -' c+2 - 1, k = 2,3,n. 

But, for p < n, (8) implies that pe < so that 

<Mi + i + (i) 2 +--*) 

< 2pe = p 10 1_t . 


Therefore, 


|4| <{n-k + 2)10 1 -', 


k = 2, 3 
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Clearly for k — l we find, as above with k = 2, that 

|£ x | < w-lO 1 ^. 

The result now follows upon setting 

Stf/c = a k E k . 

(Note that we could just as well have set 8b k = b k E k .) ■ 

Obviously a similar result can be obtained for the error due to chopping 
if condition (8) is strengthened slightly; see Problem 1. 


PROBLEMS, SECTION 2 

1. Determine the result analogous to Lemma 2, when “chopping” replaces 
“rounding” in the statement. 

[Hint: The factor 10 1_t need only be replaced by 2-10 1- *, throughout.] 

2. (a) Find a representation for fl^ 2 c <j- 

(b) If Ci > c 2 > * • * > c n > 0, in what order should fl^ 2 be ca ^‘ 

culated to minimize the effect of rounding? 

3. What are the analogues of equations (6a, b, c) in the binary representation: 

fl (a) = ±2<(.S 1 3 2 ...8 t ) 
where Si = 1 and 5, = 0 or 1 ? 


3. WELL-POSED COMPUTATIONS 

Hadamard introduced the notion of well-posed or properly posed 
problems in the theory of partial differential equations (see Section 0 of 
Chapter 9). However, it seems that a related concept is quite useful in 
discussing computational problems of almost all kinds. We refer to this 
as the notion of a well-posed computing problem. 

First, we must clarify what is meant by a “computing problem” in 
general. Here we shall take it to mean an algorithm or equivalently: a set 
of rules specifying the order and kind of arithmetic operations (i.e. 9 rounding 
rules) to be used on specified data. Such a computing problem may have 
as its object, for example, the determination of the roots of a quadratic 
equation or of an approximation to the solution of a nonlinear partial 
differential equation. How any such rules are determined for a particular 
purpose need not concern us at present (this is, in fact, what much of the 
rest of this book is about). 
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Suppose the specified data for some particular computing problem are 
the quantities a u a 2 , ..a m , which we denote as the m-dimensional vector 
a. Then if the quantities to be computed are x u x 2 , .. x n , we can write 

( 1 ) X = f(a), 

where of course the w-dimensional function f( ) is determined by the rules. 

Now we will define a computing problem to be well-posed iff the al¬ 
gorithm meets three requirements. The first requirement is that a “solution,” 
x, should exist for the given data, a. This is implied by the notation (1). 
However, if we recall that (1) represents the evaluation of some algorithm 
it would seem that a solution (i.e., a result of using the algorithm) must 
always exist. But this is not true, a trivial example being given by data 
that lead to a division by zero in the algorithm. (The algorithm in this 
case is not properly specified since it should have provided for such a 
possibility. If it did not, then the corresponding computing problem is 
not well-posed for data that lead to this difficulty.) There are other, more 
subtle situations that result in algorithms which cannot be evaluated and 
it is by no means easy, a priori, to determine that x is indeed defined by (1). 

The second requirement is that the computation be unique. That is, 
when performed several times (with the same data) identical results are 
obtained. This is quite invariably true of algorithms which can be evaluated. 
If in actual practice it seems to be violated, the trouble usually lies with 
faulty calculations (i.e., machine errors). The functions f(a) must be 
single valued to insure uniqueness. 

The third requirement is that the result of the computation should 
depend Lip sc hit z continuously on the data with a constant that is not too 
large. That is, “small” changes in the data, a, should result in only 
“small” changes in the computed x. For example, let the computation 
represented by (1) satisfy the first two requirements for all data a in some 
set, say a e D. If we change the data a by a small amount 8a so that 
(a + 8a) e D , then we can write the result of the computation with the 
altered data as 

( 2 ) x + Sx = f(a + 8a). 

Now if there exists a constant M such that for any 8a, 

(3) ||5x|| < M i[Sa||, 

we say that the computation depends Lipschitz continuously on the data. 
Finally, we say (1) is well-posed iff the three requirements are satisfied and 
(3) holds with a not too large constant, M = A/(a, 17), for some not too 
small r) > 0 and all 8a such that ||8a|| < 77 . Since the Lipschitz constant 
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M depends on (a, rj) we see that a computing problem or algorithm may 
be well-posed for some data, a, but not for all data. 

Let ^(a) denote the original problem which the algorithm (1) was devised 
to “solve.” This problem is also said to be well-posed if it has a unique 
solution, say 

y = g(a), 

which depends Lipschitz continuously on the data. That is, ^(a) is well- 
posed if for all 5 a satisfying || 5 a|| < £, there is a constant N — N{ a, £) 
such that 

(4) ||g(a + 5 a) - g(a)|| < N\\8*\\. 

We call the algorithm (1) convergent iff f depends on a parameter, say e 
(e.g., € may determine the size of the rounding errors), so that for any 
small € > 0, 

(5) ||f(a + 6a) - g(a + Sa)|| < c, 

for all 6a such that [|6a|| < 8. Now, if ,^(a) is well-posed and ( 1 ) is con¬ 
vergent, then (4) and (5) yield 

(6) ||f(a) - f(a + 8a)|| < ||f(a) - g(a)|| + ||g(a) - g(a + Sa)|| 

+ ||g(a + 5 a) - f(a + Sa)|| 

< e + N||8a[| + £. 

Thus, recalling (3), we are led to the heuristic 

observation 1. If Is a well-posed problem , then a necessary condition 
that (1 ) be a convergent algorithm is that (1 ) be a well-posed computation. 

Therefore we are interested in determining whether a given algorithm 
(1) is a well-posed computation simply because only such an algorithm 
is sure to be convergent for all problems of the form ^(a 4- 5 a), when 
^(a) is well-posed and ]|6a|! < 8. 

Similarly, by interchanging f and g in (6), we may justify 

observation 2. 1/0* is a not well-posed problem , then a necessary condition 

that (1) be an accurate algorithm is that (1 ) be a not well-posed computation. 

In fact, for certain problems of linear algebra (see Subsection 1.2 of 
Chapter 2), it has been possible to prove that the commonly used al¬ 
gorithms, (1), produce approximations, x, which are exact solutions of 
slightly perturbed original mathematical problems. In these algebraic 
cases, the accuracy of the solution x, as measured in (5), is seen to depend 
on the well-posedness of the original mathematical problem. In algorithms, 
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(1), that arise from differential equation problems, other techniques are 
developed to estimate the accuracy of the approximation. For differential 
equation problems the well-posedness of the resulting algorithms (1) is 
referred to as the stability of the finite difference schemes (see Chapters 8 
and 9). 

We now consider two elementary examples to illustrate some of the 
previous notions. 

The most overworked example of how a simple change in the algorithm 
can affect the accuracy of a single precision calculation is the case of 
determining the smallest root of a quadratic equation. If in 

x 2 + 2bx 4 c, 

b < 0 and c are given to t digits, but \c\jb 2 < 10“*, then the smallest 
root, x 2 , should be found from x 2 = c/x l9 after finding x 1 — — b + Vb 2 — c 
in single precision arithmetic. Using 

x 2 = — b — V b 2 ~ c 

in single precision arithmetic would be disastrous! 

A more sophisticated well-posedness discussion, without reference to 
the type of arithmetic, is afforded by the problem of determining the zeros 
of a polynomial 

P n (z) = z n 4 a^z 71 " 1 4- • • * 4- a x z + a 0 . 


If Q n (z) == z n + b n _ 1 z n_1 4 ■ • • 4 b x z + b 0 , then the zeros of P n (z ; <r) = 
P n (z ) + eQ n ( z ) are “close” to the zeros of P n (z ). That is, in the theory of 
functions of a complex variable it is shown that 

lemma. If z = z 1 is a simple zero of P n (z ), then for |e| sufficiently small 
P n (z; e) has a zero z^c), such that 


Fi(<0 - z x 4 e 


Qn(z l) 
P&Z i) 




If z 1 is a zero of multiplicity r of P n (z), there are r neighboring zeros of P n (z; e) 
with 


FiW - z i 


[ 


r'.QJtzM 

pn*i) 



= 6 ? (e 2/r ). 


Now it is clear that in the case of a simple zero, z l5 the computing prob¬ 
lem, to determine the zero, might be well-posed if P n '(z x ) were not too 
small and Q n (z i) not too large, since then Iz^e) —Zil/jej would not be 
large for small €. On the other hand, the determination of the multiple 
root would most likely lead to a not well-posed computing problem. 
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The latter example illustrates Observation (2), that is, a computing 
problem is not well-posed if the original mathematical problem is not 
well-posed. On the other hand, the example of the quadratic equation 
indicates how an ill-chosen formulation of an algorithm may be well-posed 
but yet inaccurate in single precision. 

Given an e > 0 and a problem ^(a) we do not, in general, know how 
to determine an algorithm, (1), that requires the least amount of work 
to find x so that ||x — y|| < e. This is an important aspect of algorithms 
for which there is no general mathematical theory. For most of the al¬ 
gorithms that are described in later chapters, we estimate the number of 
arithmetic operations required to find x. 


PROBLEM, SECTION 3 

1. For the quadratic equation 

x 2 + 2bx + c = 0, 

find the small root by using single precision arithmetic in the iterative schemes 

(a) x ^ = -Y b ~W 

and 


(b) x n + i — x n 


x n 2 + 2 bx n + c 


2x n + 2b 

If your computer has a mantissa with approximately t = 2p digits, use 
c — 1, b = -10 p 


for the two initial values 


(i) x 0 = 0; (ii) *q = -- 


Which scheme gives the smaller root to approximately t digits with the smaller 
number of iterations? Which scheme requires less work? 




2 


Numerical Solution of 

Linear Systems and Matrix Inversion 


0. INTRODUCTION 

Finding the solution of a linear algebraic equation system of “large” 
order and calculating the inverse of a matrix of “large” order can be 
difficult numerical tasks. While in principle there are standard methods 
for solving such problems, the difficulties are practical and stem from 

(a) the labor required in a lengthy sequence of calculations, 
and 

(b) the possible loss of accuracy in such lengthy calculations performed 
with a fixed number of decimal places. 

The first difficulty renders manual computation impractical and the second 
limits the applicability of high speed digital computers with fixed “word” 
length. Thus to determine the feasibility of solving a particular problem 
with given equipment, several questions should be answered: 

(i) How many arithmetic operations are required to apply a proposed 
method? 

(ii) What will be the accuracy of a solution to be found by the proposed 
method (<a priori estimate )? 

(iii) How can the accuracy of the computed answer be checked {a 
posteriori estimate) ? 

The first question can frequentlyf be answered in a straightforward 
manner and this is done, by means of an “operational count,” for most 

t For “direct” methods, the operational count is easily made; while for “indirect” 
or iterative methods, the operational count is made by multiplying the estimated 
number of iterations by the operational count per iteration. 
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of the methods in this chapter. The third question can be easily answered 
if we have a bound for the norm of the inverse matrix. We therefore indi¬ 
cate, in Subsection 1.3, how such a bound may be obtained if we have an 
approximate inverse. However, the second question has only been recently 
answered for some methods. After discussing the notions of “well-posed 
problem” and “condition number” of a matrix, we give an account of 
Wilkinson’s a priori estimate for the Gaussian elimination method in 
Subsection 1.2. This treatment, in Section 1, of the Gaussian elimination 
method is followed, in Section 2, by a discussion of some modifications 
of the procedure. Direct factorization methods, which include Gaussian 
elimination as a special case, are described in Section 3. Iterative methods 
and techniques for accelerating them are studied in the remaining three 
sections. 

The matrix inversion problem may be formulated as follows: Given a 
square matrix of order n , 


a) 


Mi 

012 

flln 

[a si 

022 

’ ' 02n 

A = (fly) s I 



\fljti 

0n2 

0nn> 


find its inverse , i.e., a square matrix of order n , say A 1 , such that 


(2) A- 1 A = A A" 1 = I = (8 W ). 

Here / is the nth order identity matrix whose elements are given by the 
Kronecker delta: 


(3) 


'0, if i / j; 
if i = j. 


It is well known that this problem has one and only one solution iff the 
determinant of A is non-zero (det A # 0), i.e., iff A is non-singular. 

The problem of solving a general linear system is formulated as follows: 
Given a square matrix A and an arbitrary ^-component column vector 
f, find a vector x which satisfies 

(4a) Ax — f, 

or, in component form, 

<*uXi + 0i2*2 + • • * + a ln x n =/i, 

... . 021*1 + 022*2 + * * ‘ + 02n*n = U, 

(4b) 


0nl*l + 0n2*2 + ’ * ' + 0nn*n = fn- 
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Again it is known that this problem has a solution which is unique for 
every inhomogeneous term f, iff A is non-singular. [If A is singular the 
system (4) will have a solution only for special vectors f and such a solution 
is not unique. The numerical solution of such “singular” problems is 
briefly touched on in Section 1 and in Problems 1.3, 1.4 of Chapter 4.] 

It is easy to see that the problem of matrix inversion is equivalent to 
that of solving linear systems. For, let the inverse of A , assumed non¬ 
singular, be known and have elements c ih that is, 

A- 1 = (c u ). 

Then multiplication of (4a) on the left by A -1 yields, since lx = x, 

(5a) x = A~ 1 f, 

or componentwise, 

n 

(5b) Xi ^5^ Cijfj) i 1,2,...,/?. 

y=i 

Thus when the c tj are known it requires, at most, n multiplications and 
(n — 1) additions to evaluate each component of the solution, or, for the 
complete solution, a total of n 2 multiplications and n(n — 1) additions. 

On the other hand, assume a procedure is known for solving the non¬ 
singular system (4) with an arbitrary inhomogeneous term f. We then 
consider the n special systems 

(6) Ax = e 0) , y’=l, 2, 

where e 0) is theyth column of the identity matrix; that is, the elements 
of e (y) are e\ f) = i = 1, 2,..The solutions of these systems are n 
vectors which we call x 0) , j = 1, 2the components of x u) are 
denoted by x\ j \ With these vectors we form the square matrix 

(7) * ^ (x[% 

in which the jt h column is the solution x U) of they'th system in (6). Then 
it follows from the row by column rule for matrix multiplication that 

(8) XX = (S i; -) =/. 

Since A was assumed to be non-singular, we find upon multiplying both 
sides of (8) on the left by A 1 that 

X= A~\ 

Thus, by solving the n special systems (6) the inverse may be computed; 
this is the procedure generally used in practice. The number of operations 
required is, at most , n times that required to solve a single system. However, 
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this number can be reduced by efficiently organizing the computations 
and by taking account of the special form of the inhomogeneous terms, 
e (/) , as we shall explain later in Subsection 1.1. 


PROBLEMS, SECTION 0 


1. If the columns of A form a set of vectors such that at most c of the columns 
are linearly independent, then we say that the column rank of A is c. (Similarly 
define the row rank of A to be r by replacing “columns” by “rows” and 
“c” by “r. ”) Prove that the row rank of A equals the column rank of A. 

[Hint: Let row rank (A) = r; and use a set of r rows of A to define a sub¬ 
matrix B with row rank ( B ) = r. Then show that c == column rank (A) = 
column rank (B). Hence, since B has r rows, c < r. Similarly show that 
r < c.] Hence define rank of A by rank (A) = r = c, 

2. (Alternative Principle) If A is of order n , then either 


or else 


Ax — o iff x = o, 
r = rank (A) < n 


and there exist a finite number, /?, of linearly independent solutions {x 0) } 
that span the null space of A , i.e., 


Ax a) = o, y = 1, 2. p, 

V 

and if Ax = o there exist constants {a } ) such that x = 2 a jX U) - Show that 


3. Observe that 

(9) Ax — f 


has a solution iff f is a linear combination of the columns of A. Hence show that: 
(9) has a solution x iff rank (A) = rank (A , f); (9) has a solution x iff y T f = 0 
for all vectors y ^ o such that y T A = o. (In this problem, A may be rect¬ 
angular; ( A , f) is the augmented matrix ). 


1. GAUSSIAN ELIMINATION 

The best known and most widely used method for solving linear systems 
of algebraic equations and for inverting matrices is attributed to Gauss. 
It is, basically, the elementary procedure in which the “first” equation is 
used to eliminate the “first” variable from the last n — 1 equations, then 
the new “second” equation is used to eliminate the “second” variable 
from the last n — 2 equations, etc. If n — 1 such eliminations can be 
performed, then the resulting linear system which is equivalentt to the 

t Two linear systems are equivalent iff every solution of one is a solution of the 
other. 
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original system is triangular and is easily solved. Of course, the ordering 
of the equations in the system and of the unknowns in the equations is 
arbitrary and so there is no unique order in which the procedure must be 
employed. As we shall see, the ordering is important since some orderings 
may not permit n — 1 eliminations, while among the permissible orderings, 
some are to be preferred since they yield more accurate results. 

In order to describe the specific sequence of arithmetic operations used 
in Gaussian elimination, we will first use the natural order in which the 
system is given, say 


(la) 

where 


(ib) 


Ax = f, 


^ — ( a i/)i 



f = 



Before the variable x k is eliminated, we denote the equivalent system (i.e., 
the reduced system), from which x l9 x 2> ..x fc _i have been eliminated, by 


(2a) 

where 


(2b) 


A (k) x = f (k) , 


A <k> = 


k = 1,2 


f (k) s 



For k 
2, 3, .. 

= 1 we have A a) = A , f (1) — f, and the elements in (2b) for k — 
n , are computed recursively by 




for / < k — 1, 

(3a) 

af = ^ 

0 

n <fc - 1 ) 

rfk-l) u i.k- 1 ( k - 1) 

U iJ n (k~ 1) u k-l. j 

for i > k 9 j < k — 1, 

for i > k 9 j> k; 



r// fc - l) 

for i < k — I, 

(3b) 

V 

II 

s 

i 

1 / 7 (fc-1) 

1 f(k - 1 ) u i.k- 1 f(k-l) 

\Ji n (k- 1 ) Jk- 1 

L a k - 1 . k - 1 

for i > k . 

These formulae represent the result of multiplying the ( k — l)st equation 
in A (k ~ 1} x — f (k ~ 1} by the ratio (a\ k k }\/a^ T^k-i) and subtracting the 




[Sec. 1] 


GAUSSIAN ELIMINATION 3 1 


result from the /th equation for all i > k. In this way, the variable x k ^ 1 
is eliminated from the last n — k + 1 equations. The resulting coefficient 
matrix and inhomogeneous term have the forms 


n m 

“u 

^(1) 
0 12 

axl * 

. /7 (1 > 

a l. k- 1 

flifc 

■ a'li.’ 


•/«., - 

0 

/7 (2) 

a 22 

*8S * 

. n (2 > 

a 2,k-\ 


•• «32? 


A 2 ' 

0 

0 

rt(3) . 

«33 








0 

n (k- 1) 
a k- 1. k- \ 


• ' al k 'iV» 

. f(k) = 

f(k-l) 

J fc-i 




0 


■■ <> 


fV 




0 

„<fc> 

“k + 1. k 

“k + 1, n 


ftk) 

J fc+1 

0 

0 

0 * 

0 


• • a <k) 

u nn J 

! 

./?’ . 


It has been assumed above that the elements a ( kk # 0 for k — 1, 2,..., n. 
When this is the case we have 


theorem 1. Let the matrix A be such that the Gaussian elimination 
procedure defined in (2)-(3) (i.e., in the natural order) yields non-zero 
diagonal elements a ( kk , k = 1, 2,..., n. Then A is non-singular and in fact , 

(5a) det A = a[\ afi ■ • • a ( n n f 


The final matrix A (n) = U is upper triangular and A has the factorization 

(5b) LU - A, 

where L = (m ifc ) is lower triangular with the elements 

fO for i < k , 

1 for i = k , 


(5c) m lk = 

The final vector f (n) = g is 
(5d) 


lflSVflg for / > k. 

g = 


Proof. Once (5b) is established, we have that det A = (det L) (det U) = 
det U and so (5a) follows. To verify (5b), let us set LU = (c 0 ). Then, 
since L and U are triangular and (4) is satisfied for k = n. 


C U = 


n 



WikVkU 


min (f. ;') 

= 2 m 


k = 1 
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We recall that a= a tj and note from (3a) and (5c) that 

— ajf for 2 < k < i, k < j. 

Thus, if i < j we get from the above 

i ~ 1 

C U = 2 m * a *i + a< ‘i 
k= 1 

= 2 w? 1 - < fc+1) ) + a u 

k = 1 

- a if . 

This holds also for / > j since aS| + 1) = 0 and so (5b) is verified. 

Define h = Lg so that 

f i 

hi = 2 m ^Sk = 2 '"ifc/fc 1 - 
/c = i fc = i 

Now from (3b) and (5c) 

Wifc/te 0 =/i te) -/l k + 1) for k < U 

and /5 1} =/. Thus, we find h t = f, and since L is non-singular (5d) 
follows. ■ 

Under the conditions of this theorem, the system (1) can be written as 

LUx = Lg. 

Multiplication on the left by L~ l yields the equivalent upper triangular 
system 

(6a) Ux = g. 

If we write U = (w iy ), then (5) is easily solved in the order x n , jc n _ 1? .. x ± 
to get 



We recall that the elements of U = A in) and g = f (n) are computed by the 
Gaussian elimination procedure (3), without the explicit evaluation of L _1 . 

We now consider the generalization in which the order of elimination 
is arbitrary. Again we set A a) = A and f (1) = f. Then we select an arbitrary 
non-zero element a\\) Y called the 1st pivot element. (If this cannot be done 
then A = O and the system is degenerate, but also trivially in triangular 




[Sec. 1] 


GAUSSIAN ELIMINATION 33 


form.) Since ^ 0 is the coefficient of they'jSt variable, x h , in the ijst 
equation we can eliminate this variable from all of the other equations. 
To do this, we subtract an appropriate unique multiple of the j^st equation 
from each of the other equations; i.e., to eliminate x h from the &th 
equation the multiplier must be m kjl = (ajy /ajj} ). 

The reduced system is written as A (2) x = f (2) and it is such that omitting 
the i \st equation yields n — 1 equations in the n — 1 unknowns x k9 
k 7 ^ y'i. We now proceed with this reduced system and eliminate a second 
unknown, say x j2 . To do this we must find some element a\ 2 ) 2 =£ 0 with 
i 2 # ii and j 2 ¥= ji , called the 2nd pivot element. If d r f = 0 for all r ^ i r 
and s ^ j\ the process is terminated as the remaining equations are 
degenerate. After this second elimination the resulting system, say, 
A (3) x = f (3) , is composed of the i\st equation of A a) x = f (1) , the i 2 nd 
equation of ^ <2) x = f (2) and n — 2 remaining equations in only « — 2 
variables, x k with k # juj 2 - The general process is now clear and can be 
used to prove 

theorem 2. Let the matrix A have rank r. Then we can find a sequence 
of distinct row and column indices (ii,y’i), (/ 2 ,y* 2 )» • • 0 ’ r > j r ) such that the 
corresponding pivot elements in A (1 \ A (2 \ ..., A (r) are non-zero and a\ r f = 0 
If i 7 ^ i*i, hi * ■ ■ j k- Let us define the permutation matrices , whose columns 
are unit vectors , 

P s (e^i 5 , e (i 2 ) ,..., e (£ ^,..., 

Q = e u z\ ..c <i » > ), 


where i k9 j k9 for l < k < r, are the above pivotal indices and the sets 
{i k } and {j k } are permutations of 1, 2,. .., n. 

Then the system 


where 


By = g, 

B = P T AQ , y =Q T x, g = P T f, 


is equivalent to the system (1) and can be reduced to triangular form by 
using Gaussian elimination with the natural order of pivots (I, 1), (2, 2),..., 
(r, r). 

Proof The generalized elimination alters the matrix A s A a) by 
forming successive linear combinations of the rows. Thus, whenever no 
non-zero pivot elements can be found fhe null rows must have been linearly 
dependent upon the rows containing non-zero pivots. The permutations 
by P and Q simply arrange the order of the equations and unknowns, 
respectively, so that b vv — a ivU , v — 1, 2,..., n. By the first part of the 
theorem, the reduced matrix B (r) is triangular since all rows after the rth 
one vanish. ■ 
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If the matrix A is non-singular, then r = n and Theorem 2 implies that, 
after the indicated rearrangement of the data, Theorem 1 becomes applic¬ 
able. This is only useful for purposes of analysis. In actual computations 
on digital computers it is a simple matter to record the order of the pivot 
indices (i v ,j v ) for v — 1, 2and to do the arithmetic accordingly. 
Of course, the important problem is to determine some order for the 
pivots so that the elimination can be completed. 

One way to pick the pivots is to require that (i k ,j k ) be the indices of a 
maximal coefficient in the system of n — k + 1 equations that remain at 
the kth step. This method of selecting maximal pivots is recommended as 
being likely to introduce the least loss of accuracy in the arithmetical 
operations that are based on working with a finite number of digits. 
We shall return to this feature in Subsection 1.2. Another commonly used 
pivotal selection method eliminates the variables x u x 2 , ..., * n -i in 
succession by requiring that (i k9 k) be the indices of the maximal coefficient 
of x k in the remaining system of n — k + 1 equations. (This method of 
maximal column pivots is particularly convenient for use on an electronic 
computer if the large matrix of coefficients is stored by columns since the 
search for a maximal column element is then quicker than the maximal 
matrix element search.) 

1.1. Operational Counts 

If the nth order matrix is non-singular, Gaussian elimination might be 
employed to solve the n special linear systems (0.6) and thus to obtain 
A _1 . Then to solve any number, say m, of systems with the same co¬ 
efficient matrix A , we need only perform m multiplications of vectors by 
A 1 . However, we shall show here that for any value of m this procedure 
is less efficient than an appropriate application of Gaussian elimination to 
the m systems in question. In order to show this, we must count the number 
of arithmetic operations required in the procedures to be compared. 
The current convention is to count only multiplications and divisions. 
This custom arose because the first modern digital computers performed 
additions and subtractions much faster than they did multiplications and 
divisions which were done in comparable lengths of time. This variation 
in the execution time of the arithmetic operations is at present being 
reduced, but it should be noted that additions and subtractions are about 
as numerous as multiplications for most methods of this chapter. On 
the other hand, for some computers, as in the case of a desk calculator, it 
is possible to accumulate a sequence of multiplications (scalar product 
of two vectors) in the same time that it takes to perform the multiplications. 
Hence one is justified in neglecting to count these additions since they do 
not contribute to the total time of performing the calculation. 
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Let us consider first the m systems, with arbitrary f(J), 

(7) Ax = fO'), j = 1, 2,.. m. 

We assume, for the operational count, that the elimination proceeds in 
the natural order. The most efficient application then performs the opera¬ 
tions in (3a) only once and those in (3b) once for each j; that is, m times. 
On digital computers, not all of the vectors f(J) may be available at the 
same time, and thus the calculations in (3b) may be done later than those 
in (3a). However, since the final reduced matrix A (n) is upper triangular, 
we may store the “multipliers” 


nit.k-i 


n (k - 1 ) 

__ a i,k -1 
“ n (k- 1 ) 5 

u k- 1 . k - 1 


2 < k < i, 


in the lower triangular part of the original matrix, A. (That is, m it is put 

in the location of a\ k kl\). Thus, no operations in (3a) ever need to be 

repeated. 

From (3) and (4) we see that in eliminating x fc _ 1? a square submatrix 
of order n — k + 1 is determined and the last n — k + 1 components 
of each right-hand side are modified. Each element of the new submatrix 
and subvectors is obtained by performing a multiplication (and an addition 
which we ignore), but the quotients which appear as factors in (3) are 
computed only once. Thus, we find that it requires 

(n — k + l) 2 + (n — k 4- 1) ops. for (3a), 

(n — k - hi) ops. for (3b). 

These operations must be done for k = 2, 3 ,...»n and hence with the 
aid of the formulae, 


2 > = 


n(n + 1) 


2*' 2 = 


n(n + 1)(2 n 4- 1) 


v=l V=1 

the total number of operations is found to be: 
n(n 2 — 1) 


(8a) 


(8b) 


n(n — \) 


ops. to triangularize A , 


ops. to modify one inhomogeneous vector, f (j). 


To solve the resulting triangular system we use (6). Thus, to compute x t 
requires (n — i) multiplications and one division. By summing this over 
i = 1, 2,...,«, we get 


(8c) 


n(n + 1) 


ops. to solve one triangular system. 
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Finally, to solve the m systems in (7) we must do the operations in (8b) 
and (8c) m times while those in (8a) are done only once. Thus, we have 

LEMMA 1. 

n 3 o n 

(9) y + mrr - ^ ops. 

are required to solve the m systems (7) by Gaussian elimination . ■ 

To compute A" 1 , we could solve the n systems (0.6) so the above count 
would yield, upon setting m — n, 

4 n 3 n . . 

•y-- ops. to compute A . 

However, the n inhomogeneous vectors e W) are quite special, each having 
only one non-zero component which is unity. If we take account of this 
fact, the above operational count can be reduced. That is, for any fixed 
j, 1 < j < n, the calculations to be counted in (3b) when f = e 0) start 
for k — j + 2. This follows, since / ( v v) = 0 for v = 1, 2,.. .,y — 1 and 
fp = 1. Thus, if j = n — 1 or j = «, no multiplications are involved and 
in place of (8b), we have 

n n - j - 1 

2 (n - k + \) = 2 v 

k = j + 2 v = 1 

= Hi 2 — (2 n - 1 )j + n 2 - n] ops. 

to modify the inhomogeneous vector e 0) for j = 1, 2,..., n — 2. 

By summing this over the indicated values of j y we find 

£( n 3 - 3 n 2 4- 2«) ops. to modify all e (y) , / = 1, 2,..., n. 

The count in (8c) is unchanged and thus to solve the n resulting triangular 
systems takes 

\{n 3 + n 2 ) ops. 

Upon combining the above with (8a) we find 
lemma 2. It need only require 

(10) n 3 ops. to compute A~ x . ■ 

Now let us find the operational count for solving the m systems in (7) 
by employing the inverse. Since A~ 1 and f (J) need not have any zero 
or unit elements, it requires in general 

n 2 ops. to compute A ~ 1 f(y). 
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Thus, to solve m systems requires mn 2 ops. and if we include the n 3 
operations to compute A _1 , we get the result: 

(11) n 3 4- mn 2 ops. 

are required to solve the m systems (7), when using the inverse matrix. 

Upon comparing (9) and (11) it follows that for any value of m the use 
of the inverse matrix is less efficient than using direct elimination. 

1.2. A Priori Error Estimates; Condition Number 

In the course of actually carrying out the arithmetic required to solve 

(12) Ax = f 

by any procedure, roundoff errors will in general be introduced. But if the 
numerical procedure is “stable,” or if the problem is “well-posed” in 
the sense of Section 3 of Chapter 1, these errors can be kept within reason¬ 
able bounds. We shall investigate these matters for the Gaussian elimina¬ 
tion method of solving (12). 

We recall that a computation is said to be well-posed if “small” changes 
in the data cause “small” changes in the solution. For the linear system 
(12) the data are A and f while x is the solution. The matrix A is said to be 
“well-conditioned” or “ill-conditioned” if the computation is or is not, 
respectively, well-posed. We shall make these notions somewhat more 
precise here and introduce a condition number for A which serves as a 
measure of ill-conditioning. Then we will show that the Gaussian elimina¬ 
tion procedure yields accurate answers, even for very large order systems, 
if A is well-conditioned, and single precision arithmetic is used. 

Suppose first that the data A and f in (12) are perturbed by the quantities 
8A and 8f. Then if the perturbation in the solution x of (12) is 5x we have 


(13) (A + 8A)(x + 8x) = f + 5f. 

Now an estimate of the relative change in the solution can be given in 
terms of the relative changes in A and f by means of 


theorem 3. Let A be non-singular and the perturbation 8A be so small 
that 


(14) \\8A\\ < 1/M‘i- 

Then if x and 8x satisfy (12) and (13) we have 


(15) 


ii8xii m m\ 

ni - 1 -hm/mh liifii + mii T 


where the condition number jjl is defined as 

(16) M-1-MII. 
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Proof. Since || A _1 8A [| < \\A -1 || • ||M|| < 1 by (14) it follows from the 
Corollary to Theorem 1.5 of Chapter 1 that the matrix I -b A~ 1 8A is 
non-singular, and further, that 

!|(/ + A-'SA)- 1 1| < - - prig^i < j _ I^-11|. [ M ||- 

If we multiply (13) by A _1 on the left, recall (12) and solve for 8x, we get 


8x = (/ + A-^Aj-'A^i 8f - 8Ax). 


Now take norms of both sides, use the above bound on the inverse, and 
divide by ||x|| to obtain 


\M < Mil (\M , 

11*11 " 1 - + 



But from (12) it is clear that we may replace ||x|| on the right, since 


11*1 s Ufjl/Ml, 

and (15) now easily follows by using the definition (16). ■ 

The estimate (15) shows that small relative changes in f and A cause 
small relative changes in the solution if the factor 


1 - pfsAWIWAW 


is not too large. Of course the condition (14) is equivalent to 




II M| 

Mil 


< 1. 


Thus, it is clear that when the condition number ij.(A) is not too large, 
the system (12) is well-conditioned. Note that we cannot expect {jl(A ) to be 
small compared to unity since 


||/|| = \\A~'A\\ <n(A). 


We can apply Theorem 3 to estimate the effects of roundoff errors 
committed in solving linear systems by Gaussian elimination and other 
direct methods. Given any non-singular matrix A , the condition number 
jji(A) is determined independently of the numerical procedure. But it is 
possible to view the computed solution as the exact solution, say x + 5x, 
of a perturbed system of the form (13). The basic problem now is to deter¬ 
mine the magnitude of the perturbations, 8A and 8f. This type of approach 
is called a backward error analysis . It is rather clear that there are many 
perturbations 8A and 8f which yield the same solution, x + 8x, in (13). 
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Our analysis for the Gaussian elimination method will define 8A and Sf 
so that 


5f = 0. 


Then the error estimate (15) becomes simply 


(17) 


118*11 „ mII^I/MII 
INI - 1 - ri\8A\\/\\A\\’ 


and it is clear that only ||M|| in this error bound depends upon the round¬ 
off errors and method of computation. 

In the case of Gaussian elimination we have seen in Theorem 1, that 
exact calculations yield the factorization (5b), 


LU = A. 


Here L and U are, respectively, lower and upper triangular matrices 
determined by (5c) and (3a). However, with finite precision arithmetic in 
these evaluations, we do not obtain L and U exactly, but say some tri¬ 
angular matrices if and We define the perturbation E due to these 
inexact calculations by 

(18) = A + E. 

There are additional rounding errors committed in computing g defined 
by (3b) or (5d), and in the final back substitution (6b) in attempting to 
compute the solution x. With exact calculations, these vectors are defined 
from (5d) and (6a) as the solutions of 

Lg - f, Ux = g. 

The vectors actually obtained can be written as g + 8g and x + 8x 
which are the exact solutions of, say, 

(19a) (if + Sif)( g + 5g) = f, 

(19b) (W + 8^)(x + 8x) - (g + 5g). 

Here if and account for the fact that the matrices L and U are not 
determined exactly, as in (18). The perturbations Sif and 8°U arise from 
the finite precision arithmetic performed in solving the triangular systems 
with the coefficients if and Upon multiplying (19b) by if + 8if and 
using (19a) we have, from (13) with 5f = 0, 

(A + 8A) - (if + aifjoar + 

From (18), it follows then that 

(20) 8A = E + &{W) + (&Sfytr + (8if)(S^). 
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Thus, to apply our error bound (17) we must estimate H^H, ||Sf/||, and 
Since 3? and % are explicitly determined by the computations, 
their norms can also, in principle, be obtained. 

We shall assume that floating-point arithmetic operations are performed 
with a r-digit decimal mantissa (see Section 2 of Chapter 1) and that the 
system has been ordered so that the natural order of pivots is used [i.e., as 
in eq. (3)]. In place of the matrices A (k) = {a\f) defined in (2) and (3a), 
we shall consider the computed matrices B {k) = (b (k) ) with the final such 
matrix B (n) = — ( u i} ), the upper triangular matrix introduced above. 

Similarly, the lower triangular matrix of computed multipliers 3? = (%) 
will replace the matrix L = (m xj ) of (5c). For simplicity, we assume that the 
given matrix elements A = (a i3 ) can be represented exactly with our 
floating-decimal numbers. 

Now in place of (3a) and (5c), the floating-point calculations yield 
b\ k) and s tj which by (2.6) of Chapter 1 can be written as: 

for k = 1, 

(21a) b\y = a ijy i,j = 1,2 

for k = 1, 21, 


(21b) b\ k + 1) 


and finally 


(21c) 


'b[f, i < k, 

0, i > k + 1, j < k, 

{ [b\V - s ik btf (1 + 0SflO-‘)](l + 

, i > k + 1 , j > k + 1; 


0 , 

1 , 

jf(i + ^io-O. 


i < j ; 

* =j; 

i > j . 


Here the quantities, 8 , </>, 0 satisfy 


IWI ^ 5, < 5, |^| < 5, 

and they account for the rounding procedures in floating-point arithmetic. 
Of course, the above calculations can be carried out iff the b$ / 0 for 
j < n — 1. However, this can be assured from 


lemma 1 . If A is non-singular and t sufficiently large , then the Gaussian 
elimination method , with maximal pivots and floating-point arithmetic 
{with t-digit mantissas ), yields multipliers s i3 with |^ ; -| < 1 and pivots 

by/ # o. 


Proof See Problem 8. 




[Sec. 1.2] 


A PRIORI ERROR ESTIMATES; CONDITION NUMBER 41 


It turns out that we require a bound on the growth of the pivot elements 
for our error estimates. That is, we seek a quantity G — G(n ), independent 
of the a ijt such that 

(22a) \b { $\ < G{n)a, j = 1,2 

where 

(22b) a == max | a tj \. 

Under the conditions of Lemma 1, it is not difficult to see, by induction 
in (21b), that 

(22c) G(n) < [1 + (1 + 5 x lO^) 2 ]”" 1 = 2 n ’ 1 + 0((w - OH) 1 "*). 

This establishes the existence of a bound of the form (22), but it is a 
tremendous overestimate for large n. In fact for exact elimination (i.e., 
no roundoff errors) using maximal pivots it can be shown that 

(23a) K)1 < g(j)a, 

where 

(23b) g(j) < 3 \j* + *"*. 

The quantity g(n) would be a reasonable estimate for G(n) if the maximal 
pivots in the sequence {£ (fc) } were located in the same positions as the 
maximal pivots in {A (k) }. We know that if A is non-singular and t is suffi¬ 
ciently large, then the indices of the maximal pivotal elements used to find 
{B (k) } are also indices of maximal pivots in an exact Gaussian elimination 
procedure for A. For two special classes of matrices it is established in 
Problems 6 and 7 that g(n) < 1 and g(n) < n. The best (i.e., lowest) 
bound for G(n) [or for g(n)] is not known at present. 

We now turn to estimates of the terms in SA. Our first result is a bound 
on the elements of £ which we state as 


theorem 4. Under the hypothesis of Lemma 1 the Gaussian elimination 
calculations (21) are such that 


= A + E 


where E = (e i} ) satisfies 


f(i - l)2aG(«)10 1 “ t , 

"\ 7 * 2 flG(n) 10 l " 


Here G(ri) is any bound satisfying (22). 


Proof We write the last line of (21b) as 


for i < j ; 
for i > j. 


(25a) ^ +1) = b\f - s lk btf + + i > k + 1, j > k + 1, 
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where from Lemma 1 and (22) it follows that 

(25b) < 2aG(n)W~\ i > k + \J > k + 1. 

Similarly, multiplying the last line of (21c) by b { j- and dividing by 
1 -f we can write the result as 

(26a) 0 = b$ - Sij V$ + i>j ; 

where again we find that 

(26b) |eg: + 1) | < 2aG{n)W~\ i > j. 

Upon recalling (21) we have 

if S (Jy), V ^ («J>) 

so that 

= 2 sjm 

= i 

min (ij) 

= 2 

k ~ 1 

Now let i > j and sum (25a) over k = 1,2,.. .,y — 1 and then add (26a) 
to get, with the aid of (21), 

0 = 6j, u - 2 + j ^ + 1 >, i > j. 

k =1 Je = 1 

From the last two equations above and the fact that 

U X) U \) 

we see that the elements e i; with i > j of £ = Z£°ll — 4 are 
(27a) e u = 2 e « + 1) . * > / 

k = 1 

For the elements with i < j, we just sum (25a) over k = 1, 2 ,..i — 1, 
recalling that ^ — 1, and obtain as before 

(27b) e«=2<f +1> . * * 7- 

fc= 1 

But now (24) follows from (27) by using the bounds (25b) and (26b). ■ 

As a simple corollary of this theorem, we note that since \e tj \ < 
2 (n — l)aG(n) 10 1 “ 4 , it follows that 

(28) l^loo < 2 an(n - \)G(n) 10 1 "*. 
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The elements in 83? and 89/ can be estimated from a single analysis of 
the error in solving any triangular system (with the same arithmetic and 
rounding). Thus we consider, say, 

(29a) 7u - h 

where T = (t ti ) is lower triangular and non-singular, i.e., 

(29b) t xi p 0; t { j = 0, j > i; i = 1,2,..., n. 

The exact solution of (29) is easily obtained by recursion and is 
(30a) u x = t{ 1 l h 1 

(30b) Ui = - 2 i’ = 2, 3,. .., n. 

For numerical solutions, we have 

theorem 5. Let the “ solution ” of (29) be computed by using t-digit 
floating-point arithmetic to evaluate (30). Then the computed solution , say v, 
satisfies 

(31a) (r+S7> = h 

where the perturbations are bounded by 

(31b) \8t tj \ < max [2, \i -y+ 11]|/„|lO 1 ^ < n\t u \ 10 1 "*. 

Here t is required to be so large that «10 1 “ i < 1. 

Proof In the notation of Section 2 of Chapter 1 the floating-decimal 
evaluations, v h of the formulas (30) are 



dh, - % 

= fl—- P - i = 2, 3,n. 

hi 

Then by using (2.6c) of Chapter 1, we must have from the above 
(32a) » 1 =£(1 +*il0-D, ^ 5; 

hi 

fl (^i - 2 

(32b) v t = ^ - P -Ml+<M0-‘), |A| < 5,/ = 2,3,...,«. 

Mi 
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If, in the floating-point evaluation of the numerator in (32b) the sum is 
first accumulated and then subtracted from h h we can write, with the use 
of (2.6a) and Lemma 2.2 of Chapter 1, 

_ 2 _ 2 ^fik) y * o + 

\ fc=l / L fc= 1 

i = 2, 3,.. 

Here |0 f | < 5 and 

l 8 | < Ol/ifcllO 1 -*, 2 < fc < i — 1, 

1 ifc ' “ 12|/ I . I _ 1 |10 1 “S k = 1, i = 2,3,...,*. 

From the above and (32) we now obtain, solving for the h jy 
*ni>i(l + <^ilO- £ ) _1 = h u 

hM 1 + ^10-‘) -1 (l + fl.10-*)- 1 + 2 (*.* + 8t tk )v k = h„ 

k = 1 

i = 2, 3,..., n. 

However, if we write 


hi + &hi = LiO + ^ilO 0 1 
fit + ^fit — fii(l + <£ilO -£ ) _1 (1 + ^10“ f ) _1 
then it follows from |^| < 5, |0 t | < 5, and «10 1_£ < 1 that 

|«fi*| < 2|fi i |10 1 - £ , i> 1. ■ 

We are now able to obtain estimates of the elements in 8L£ and 8°U 
or more importantly those in 8A. These results are contained in the 
basic 


theorem 6 (wilkinson). Let the mh order matrix A be non-singular and 
employ Gaussian elimination with maximal pivots and t-digit floating-point 
arithmetic to solve (12). Let t be so large that Lemma 1 applies and that 

nl O 1 ' 4 < 1. 


Then the computed solution , say x 4- 8x, satisfies 
(A 4- S4)(x 4- Sx) = f 

where 

(33) [&ty| < (2 n 4 3n 2 )G(n)a\0 1 ~ t 

and G(n) is any bound satisfying (22). 
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Proof. We have already shown that 8A is given by (20) where the ele¬ 
ments of E are estimated in (24). Since 8J? is the perturbation (19a) in 
solving the lower triangular system = f, we can apply Theorem 5. 
We note that the elements of = (^ i; ) are given by (21c). Since maximal 
pivots are used \b { f\ > \b\f\ so that | s l} \ < 1 and we easily get from (31b) 
in this case 


Km,i s |s%i < wio 1 - 1 . 

The elements of are the perturbations in solving a system of the 
form °tty = z with °tl = (u i} ) = (&$) where the b$ are defined by (21b). 
This system is, of course, upper triangular but the estimates of Theorem 5 
still apply. Since maximal pivots are used we have, recalling (22), 

Kl = |A«| ^ l^i’l ^ G(n)a, i < j, i = 1, 2,..«; 
and now (31b) yields 

|(8#) w | - |8f/„| < nGWalO 1 " 1 . 

From (20) we have 

min(i, j) 

Sati = e tj + ^ (s ik Su kj + &s ik u kj + 8s ik 8u kj ). 

te = l 

By taking absolute values and using the above bounds on |Sw i; |, 

| Uijl, l^yl, as well as (24), we easily obtain 

\8a i} \ < ( 2n + 2 n 2 + ^lO 1 -^G^alO 1 -*. 

However, since it was required that nl0 1_t < 1, the result in (33) 
follows. ■ 

From this theorem, it follows that the computed solution is the exact 
solution of a system only slightly perturbed from the original if enough 
figures are used, i.e., t sufficiently large. Appropriate values for t depend 
upon n and the bound G(n). If, as indicated in (22c), G(n) were of the order 
2 n " 1 , only relatively small order systems could be treated effectively. 
On the other hand, if G(ri) ^ g{n) < 3n /2 + 1/4 ln n , as in (23), then quite 
large order systems can be treated with the number of digits used on 
modern digital computers (say t > 8). It is generally believed, however, 
that even this latter estimate for G(n) is a generous overestimate, when 
using maximal pivots. It should be observed that essentially all of the 
previous analysis is valid if only partial pivoting, say maximal column 
pivoting, is employed since then < 1 is maintained. However, the 
growth factor G(n) for this procedure cannot be estimated well in general. 
In fact, it is possible that the upper bound (22c) which still applies may be 
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attained. In spite of this, partial pivoting is found to be effective in practice 
but the absence of any type of maximal pivoting strategy frequently leads to 
catastrophic growth of rounding errors, t 
From (33) we easily find that 

(34) ||||„ < ( 2n 2 + 3« 3 )G(«)al0 1 - f , 

and this can be employed in (17) to obtain maximum norm bounds on 
the relative error. It is clear that this relative error in x may not be small 
even though the relative perturbation, ||8i4||/||>4||, is small. In such a case, 
A would be ill-conditioned. By (17), the relative error ||8x[|/|[x|[ is small 
if M -1 || || 8A || is small. For a given G(n) and fi(A) equation (34) may 
be used to find the value of t that assures a solution with a prescribed 
accuracy. 

Finally, we recall that a computed inverse of A can be obtained by sol¬ 
ving the n systems (0.6). If we denote the matrix obtained as A~ 1 + F, 
then as above we can show that each column vector of A" 1 + F, i.e., 
(A -1 A- F) ; , satisfies an equation of the form, for some perturbation 
matrix 8A U \ 

(35a) (A + 8A ( »)(A ~ 1 + F) j ■ = e ( » j = 1, 2, ..., *. 


Under the assumptions of Theorem 6, the estimates (33) and (34) also 
apply to the current perturbations, 8A (J '\ Then, if ||S/F ;) || < 1/||/I _1 || 
we have, almost as in the proof of Theorem 3, 


(35b) 


II FA ^)l^ (j) ll/Mll 

MTII “ i - KA)\\8a^\\/m 


Thus, as was to be expected, the columns of the inverse matrix are obtained 
to within the same relative error [i.e., compare (17)] as is the solution of 
any particular system. 

1.3. A Posteriori Error Estimates 

Although we do not advocate inverting a matrix to solve linear systems, 
it is of interest to consider error estimates related to computed inverses. 


t Experience indicates that we usually achieve greater accuracy in the single precision 
solution, if we first scale the matrix A, That is, if with B — D l AD 2 , we solve 

By = DJ 

for y; and then determine x from D 2 y ~ x. Here D x and D 2 are some diagonal 
matrices chosen so that the n columns and the n rows of the matrix B have approxi¬ 
mately equal norms. A complete mathematical explanation of this phenomenon is 
not available. 
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Let A be the matrix to be inverted and let C be the computed or alleged 
inverse. The error in the inverse is defined by 

(36a) F = C — A- 1 ; 

we also use another measure of error called the residual matrix : 

(36b) R — AC — /. 

We have first 


THEOREM 

7. //||*|| < 1 then: 


(37a) 

A and C are non-singular ; 

(37b) 

M~ l ll ^ 

IICI/O - 11*11); 

(37c) 

1*11 * IIC 

II -1*11/0 - 1*1). 

Proof. 

We write (36b) as 



AC = I + R, 


and use the corollary to Theorem 1.5 of Chapter 1 and \\R\\ < 1 to deduce 
that AC is non-singular. Part (37a) then follows. Take the inverse of both 
sides in the above equation and multiply on the left by C to find 

A- 1 - C(7 + R)-\ 

Now (37b) follows by taking norms and by again using the corollary to 
Theorem 1.5 of Chapter 1. From (36) we see that F = A~ l R and so, 
||F|| < M _1 ||* ||7*11, and (c) follows by an application of (b). ■ 

Note that we may just as well consider A to be an approximation to the 
inverse of C. Thus we obtain the 

corollary. Under the hypothesis of Theorem 7, 

(37d) lie- 1 !! M11/0 - 11*1), 

(37e) \\A - C- 1 ! < M||-11*11/(1 - ||*||). ■ 

Since A and C are presumed known, we could actually compute ||C||, 
|[v4||, and \\R\\ in the estimates (37). This, of course, is what is meant by 
a posteriori estimates. In general, n 3 multiplications are required to form 
AC and this computation, as well as that of the norms, is subject to simply 
estimated roundoff errors. In contrast, the quantity 8A entering the a 
priori estimates (17) and (34) cannot be computed. It is hardly necessary 
to point out that G(n) is determined easily after the elimination process 
has been completed. 
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It is of interest to note that, under the hypothesis of Theorem 7, with 
C an approximate inverse of A we can find the perturbation 8A , so that C 
is the exact inverse of A + 8A. That is, set 

(A + 8A)C = 1. 

Hence, 

~8A = (AC - I)C~ X 
= RC~K 


Upon taking norms and using (37d), we then have 


(38) 


\\BA\\ < 


11*11-Mil 

1 - 11 * 1 !' 


Finally, we observe that the computed inverse can also be used to 
estimate the error in solving a linear system. We state this result as 


theorem 8. Let an approximate solution y of 

Ax — f 


have the residual vector 


(39) 


r = Ay - f. 


Then, if an approximate inverse C of A satisfies |[/?|| = || AC — /|| < 1, 
we have 


(40) 


x < 


|r|M|C|| 

11 * 11 ' 


1 


Proof. From Theorem 7 it follows that A is non-singular and so from 
(39) 

y = A~'( r + f). 

Subtract x = A -1 f from this to find, after taking norms, 

IIy - x ll ^ M“TM|. 

The result (40) then follows from (37b). ■ 


The determination of the residual vector r is the first step in an iterative 
procedure to improve upon the accuracy of the solution (see Subsection 
4.3). 

It should be noted that the result in this theorem is independent of the 
manner in which the approximate solution, y, is obtained. Thus, it was 
not assumed that y — Cf. This suggests, in fact, that the sole purpose for 
computing C and R might be to use them in error estimates of the form 
(40). That is, once the constant M = ||C||/(1 - |j/?||) is known, it requires 
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only n 2 multiplications to compute r for each approximate solution y 
of a system with coefficient matrix A. If one wished to use (40), after 
finding y by Gaussian elimination, then an approximate inverse C could 
be obtained by using the approximate factorization of A. This, by (8a) 
and (10), would require twice as much labor as was already expended to 
find y. 


PROBLEMS, SECTION 1 


1. Show that A (k) is non-singular iff A is non-singular, for the Gaussian 
elimination method. 

2. Describe how the maximal pivot scheme permits the completion of the 
elimination method, when A is singular. 

3. Prove the following corollary to Theorem 2: If interchanges of rows and 
of columns are made and r = /j, then 

det A = (MW1IA...4 


where a\ k ] k are the successive pivotal elements in the Gaussian elimination 
scheme and (— 1) J = det P det Q. 

4. If A is symmetric and positive definite (that is, x*Ax > 0 and 

n 

2 QijXiXj = 0 only if x t = 0 for all /; a {J and x { real), show that 

t.y=i 

(a) a tt > 0 

(b) max On = max \a if \. [Hint; For (b), if |a rs | = max |<7 0 |, then with 
x t = 0 for i ^ r, s 


n 



OijXtXf 


a rT x r 2 + 2 a ri> x T x s + a ss x s 2 = 0 


for non-trivial x r , x s if a rr a ss < a rs 2 .] 

5. If A is symmetric, positive definite, then the submatrices (a (k) ) for 
k < ij < n are symmetric, positive definite. [Hint: Use mathematical in¬ 
duction on k. Symmetry from 


(i) 

a\r - 


,(1) 

■<ij * 

n 


Positive definiteness from Problem (4a) and 


n n r n (i> T 2 

2 fiif'xiXj = 2 a< iVxiX } - aivUi + 2 ~jh x i\ * 

i. j = 2 Ui= 1 L J = 2 a ll J 

That is, if (a\ 2) ) is not positive definite, then (aj^) is not,] 

6. (Von Neumann-Goldstine.) If A is symmetric, positive definite, then 

a\ f 0 < for k < i < n, k = 2, 3, . .., n. (Hence by Problem (4b), 

max \a\p\ < max |a y |.) 

7. (Wilkinson.) If A is a Hessenberg matrix (i.e., a {j = 0 for i > j + 2), 

max \a\p\ < n max |a iy |, 


if maximal column pivots are used. [Hint: Only one row is changed in passing 
from A {k ~ 1} to A ik \] 
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8. Prove Lemma 1. 

9. If A is symmetric, use the first part of Problem 5 to show: the number of 
operations to solve (1) by Gaussian elimination, with diagonal pivots, is 
n 3 /6 + (9(n 2 ). 

10. If in (1), A and f are complex, show that (1) may be converted to the 
solution of a real system of order 2 n. 


2. VARIANTS OF GAUSSIAN ELIMINATION 


There are many methods for solving linear systems that are slight varia¬ 
tions of the Gaussian elimination method. None of these methods has 
succeeded in reducing the number of operations required, but some have 
eliminated much of the intermediate storage or recording requirements. 
Caution should be taken in applying any variation that does not allow 
for the selection of some sort of maximal pivots, which is generally neces¬ 
sary to prevent the growth of rounding errors. 

The modification due to Jordan circumvents the final back substitution. 
This is accomplished by additional computations which serve to eliminate 
the variable x k from the first k — 1 equations as well as from the last n — k 
equations at the kth stage of the reduction. In other words, the coefficients 
above the diagonal are also reduced to zero and the final coefficient matrix 
which results is a diagonal matrix. The obvious modifications which are 
required for this Gauss-Jordan elimination are contained in 


(la) 

a\r 

— a U 

(lb) 

a [f 

_ n (k- 1) 

— a ti 

(lc) 

dk- i.j 

_ n (k~l) 

(Id) 

a\f 

— „(*-!) 
— a i} 


(2a) f^=f 

(2b) /i» = /r 1( - Jgr /g-i u 

a k - 1 . k~ 1 

(2c) /£»=/£.-!» 

The solution is then 


for i ^ k — 1, j > k — 1 

for j > k — 1 
for j < k — 1. 

for i ^ k — 1 
for k = 2,..., n. 


v =/l! =t±L 

‘ w <' 

It is clear that pivoting on the maximal element in the remaining square 
submatrix may be retained in this procedure. Hence, multipliers for 




[Sec. 2] 


VARIANTS OF GAUSSIAN ELIMINATION 51 


i < k — 1 may exceed unity. Furthermore, the number of operations is 
somewhat greater than in the ordinary Gaussian elimination with back 
substitution; it is now 

(3) J + n 2 __ 0 ps. 

Thus, there does not seem to be any great advantage in using the Gauss- 
Jordan elimination in actual calculations with automatic computing 
equipment. 

Another variation is the so-called Crout reduction. This method is 
applicable if the rows and columns are so arranged that no column inter¬ 
changes are required in the Gaussian elimination (as in the case of sym¬ 
metric, positive definite matrices; see Theorem 3.3). Thus, in general, the 
pivots will not be the maximal elements. Hence, errors may grow very 
rapidly in the Crout method and it is not recommended unless the system 
is of relatively small order or if it can be determined that the error growth 
will not be catastrophic. (In practice one may apply the method and test 
the accuracy of the solution a posteriori.) On the other hand, the Crout 
method is specifically designed to reduce the number of intermediate 
quantities which must be retained. Thus, for hand computations and 
digital computers with small storage capacities it may be of great value. 
The Crout method may be modified to use maximal column pivots, by 
incorporating row interchanges as described in Theorem 3,1 (or see 
Theorem 1.2). 

This “compact” elimination procedure is based on the fact that only 
those elements a[)\ in the Gaussian elimination, for which j > i and 
/' < k , are required for the final back substitution. 

Thus, we seek a recursive method of defining the columns of L (lower 
triangular matrix of multipliers) and rows of U (upper triangular matrix). 
From Theorem 1.1, we know that 

LU = A. 

Hence, let us write the formula for a ki from the rule for matrix multipli¬ 
cation, after a simple algebraic transformation, in the form 

k- 1 

(4) u ki = a kj - 2 m kpU Pi , if k < j; 

P = 1 

and the formula for a ik in the form 

k - 1 

a ik - 2 m iP u rk , 
p = 1 


= 


(5) 


if i > k. 
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We now may use (4) and (5) for k > 2 to find first the elements of the 
Arth row of U and then the elements of the &th column of L, provided 
that we know the previous rows and columns, respectively, of U and L. 
Hence, we need only define the first row 

(4) i u 1} = a u for 1 <j<n 

and first column 

(5) i m iX = — for 2 < i < n. 

an 

If we define w 1>n + x = f x ; a itn + 1 = f and use (4) for j — n 4- 1, we find 
a column (w in + 1 ) which is the vector g of Theorem 1.1. 

We then use the back substitution to solve Ux = g as before [where U 
represents the first n columns of (w f; )]. The operational count, for producing 
L, U , and g, is easily found to be (2 n 3 + 3 n 2 — 5«)/6. It is not surprising 
that this is the same as the number of operations required by the conven¬ 
tional Gaussian elimination scheme to produce L , U , and g (since we 
merely avoid writing down the intermediate elements but have ultimately 
to do the same multiplications and divisions). 

We could show now, if the inner products in (4) and (5) are accumulated 
in double precision before the sum is rounded, that the effect of rounding 
errors is appreciably diminished. In fact, the estimate in place of (1.34) 
becomes 


\\SAU = (9{n 2 G{n)a\0 *)• 

PROBLEM, SECTION 2 

1. Verify the operational count for the Crout method: (2/7 3 + 3 n 2 — 5n)/6. 


3. DIRECT FACTORIZATION METHODS 

The final forms, (2.4) and (2.5), that are used in the Crout method, 
suggest a more general study of the direct triangular decomposition 

(1) LU = A, 

in which the diagonal elements of L are not necessarily unity. In fact, 
if we consider L = (J i} ) then (1) implies 

k- 1 

Uk u kk ~ a kk ~~ 2 ^ kp u p 

P= 1 


(2a) 


for k > 2 
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(2b) 

1 , 

“w = t 1 

‘kk 

\^kj 2 hp u pf^ ? 

for j > k > 2 


u kk 

/ *-1 \ 


(2c) 

|^i/c hpMpkj j 

for i > k > 2. 


[Equations (2a, b, c) hold for k = 1, if we remove the 2 term.] Equation 
(2a) determines the product l kk u kk in terms of data in previous rows of U 
and columns of L. Once l kk and u kk are chosen to satisfy (2a), we then 
use (2b) and (2c) to determine the remaining elements in the &th row 
and column. 

If l kk u kk — 0, the factorization is not possible, unless all of the brackets 
vanish in (2b), for / > k , or all of the brackets vanish in (2c), for i > k. 
If A is non-singular, then the use of maximal column pivots results in the 
sequence (i u 1), (/ 2 , 2),..(j n , n) as pivotal elements. Hence, the Gaussian 
elimination process shows that the triangular decomposition 

LU = P T A , 

is possible, where P is defined in Theorem 1.2. In fact, if A is non-singular, 
one of the bracketed expressions in (2c) does not vanish, for some i > k. 
Therefore, one of the bracketed expressions in (2c) is of maximum ab¬ 
solute value for / > k , say for i — i k > k, We may then move the elements 
of the row i k in both A and in the part of L that has already been found 
up to row k. (The rows k , k 4- I,..., i k — 1 are moved down in both 
L and A to fill the gap.) Hence, if A is non-singular we may, with row 
interchanges, employ (2a, b, and c) to achieve a triangular factorization. 
We summarize these facts in 

theorem 1. If A is non-singular , a triangular decomposition , LU = A, 
may not be possible . But a permutation of the rows of A can be found such 
that B = P T A — LU, where P = ( p rs ) and 


In fact , the P may be found so that 

|4*| > IAicI for i > k; k = 1, 2 ,..n - 1. ■ 

Note that in this result, in contrast to that in Theorem 1.2, we have only 
employed row interchanges. 

A symmetric choice l kk = u kk may lead to imaginary numbers, if the 
right-hand side of (2a) is negative; a less symmetric choice l kk — \u kk \ 
keeps the arithmetic real if A is real (see Problem 1). 
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As in the Crout method, we may consider f as an additional column 
of A (i.e., a itR + 1 = f) and use (2b) for j — n + 1 to define the elements 
g { =e u it n + i such that 

Ux = L 1 f = g. 

In Subsections 3.1, 3.2, and 3.3, we consider special applications of this 
procedure. 

3.1. Symmetric Matrices (Cholesky Method) 


We begin with 

theorem 2. Let A be symmetric . If the factorization LU = A is possible , 
then the choice l kk = u kk implies l ik = u kx , that is , LL T — A. 

Proof Use (2) and induction on k. ■ 

A simple, non-singular, symmetric matrix for which the factorization is 
not possible is 

n- 

On the other hand, if the symmetric matrix A is positive definite (i.e., 
x*Ax > 0 if x*x > 0), then the factorization is possible. We have 

theorem 3. Let Abe symmetric , positive definite . Then , A can be factored 
in the form 

LL T = A. 


Proof Problems 4 and 5 of Section 1 show that the Gaussian elimina¬ 
tion method can be carried out, without any interchanges, to give the 
factorization (m i; )(6 0 ) = A , where b i{ > 0. But if we define 

lkk = u kk = ^bkk j 


then by Problem 1, we will obtain from (2b, c) the elements in the factoriza¬ 
tion 


where 


LU = A, 

lik = «w ■ 


A count of the arithmetic operations can be made if we remark that only 
the elements l ik defined by (2a, c) are involved. If we count the square 
root operations separately, we have 


rP 

6 


+ n 2 - 


g ops. + n square roots — no. of ops. to find L and g. 
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In addition, to find x we must solve a triangular system which requires 
( n 2 + n)/2 operations. Thus, to solve one system using the Cholesky method 
requires n 3 /6 + 3 n 2 /2 + nj 3 operation plus n square roots. 

To apply our previous error analysis, we deduce from 

i t 

a ii — 2 ^ ikUki ~ 2 

fc=l fc=l 

the bounds 

|/ ifc | 2 < a u < a = max |%|. 

i 

For single precision square roots and G(n) = 1, we could prove (as we do 
in Theorem 1.6) 

theorem 4. If A is symmetric and positive definite , then the approximate 
solution of Ax — f obtained by factorization and floating-point arithmetic 
with t digits satisfies 

(A + SA)y = f, 

where for t sufficiently large 

||M||oo < aW-\2n 2 4- In 3 ). ■ 

corollary. Under the hypothesis of Theorem 4, if inner products are 
accumulated exactly , prior to a final rounding , then for t sufficiently large 

II^IU = 0(« 2 aio-o. ■ 

3.2. Tridiagonal or Jacobi Matrices 

A coefficient matrix which frequently occurs is the tridiagonal or Jacobi 
form, in which a tj — 0 if \i~J\ > 1. That is, 

a 1 

b<z ^2 

(3) A = ... 

&n- 1 t2 n _ i C n _ i 


b> 
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Assume this matrix can be factored in the bidiagonal form 



Then we find, 

(4a) a 1 = a u yi = Ci/a!; 

(4b) « t = a { - b<yi- l9 i = 2, 3,.. n; 

(4c) y* = Ct/ofi, i* = 2, 3,. . n - 1. 

Thus, if none of the cc { vanish, the factorization is accomplished by 
evaluating the recursions in (4). The “intermediate” solution g of Lg — f 
becomes 

(5a) g! = /i/«!; 

(5b) gt = (f ~ bigt- i)K i = 2 ,..., n; 

and the final solution x of f/x — g is given by 
(6a) x n = g n , 

(6b) Xj = gi~ y t x j+ u j - n - 1, n - 2,..., 1. 

In many of the applications of this procedure, the elements (3) of A 
satisfy 

(7a) |fli| > | Ci | > 0; 

(7b) |Oi| > |6,| + |c,|, b,Ci # 0, / = 2, 3,- 1; 

(7c) |a n | > |6 n | > 0. 

In such cases, the quantities a t and y t can be shown to be nicely bounded 
and in fact A is non-singular. We state this as 

theorem 5. If the elements of A in (3) satisfy (7) then det A ^ 0 and the 
quantities in (4) are bounded by 

(a) |y,| <1; (b) |a,| - |6 ( | < |a ( | < |a,| + |Z»,|. 
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Proof. From (4a) and (7a), it follows that \y x \ < 1. Assume \y t \ < 1 
for i = 1,2,...,y — 1. Then by (4b, c) 


and thus 


a i “ b tYi-i 


i v [ < l£d ^ l£d 

|y/l - IN - Nb-ill N - N 


by the inductive assumption. Finally, by using (7b, c) in the above, it 
follows that \y f \ < 1 and hence (a) is proved. Using this result and (4b) 
it follows that 


W + M > hi > hi I - hi I (a ki|), 

which concludes the proof of the inequalities (a) and (b). But then 


det A — (det L) * (det U) — J~[ a j # 0. ■ 

It should be noted that, when the conditions (7) hold, the procedure 
defined in (4) must be valid. Further if — 0 for some i # l, n, then the 
system can be reduced to two systems which are essentially uncoupled. 
Similarly, if c x — 0 or b n = 0 then x 1 or jc n , respectively, can be eliminated 
to get a reduced system. 

The operational count for this procedure is somewhat striking: 

(4) requires 2{n — 1) ops. 

(5) requires 1 + 2 (n — 1) ops. 

(6) requires n — 1 ops. 

or a total of 

(8) 5n — 4 ops. 

to solve a single system. If there are m such systems to be solved, the 
quantities a i9 y t in (4) need be computed only once and (5) and (6) are then 
each done m times for a total of 


(9) 


(3 n — 2 )m + 2n — 2 ops. 


to solve m systems. Consequently, the inverse can be obtained, although 
it should never be used in such circumstances to solve the system, in not 
more than 

3 n 2 — 2 ops. 

The low operational counts in (8) and (9) are due to the fact that the zero 
elements of A have been accounted for in performing the calculations. 
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It should be observed that the factorization computed in (4) is not unique. 
Thus, for instance, we could try the form 


1 


“l C\ 

& 1 


a 2 C2 

p* 1 





Cn — 1 

Pn 1. 

1 

«n . 


The reader should derive the recursions analogous to (4)—(6) for this case 
and prove the corresponding version of Theorem 5. We give the reader 
leave to develop a treatment and operational count for the general band 
matrix. A matrix (c iy ) of order n is called a band matrix of width {b, a) iff 

c u — 0 for j — / > a or i — j > b. 

3.3. Block-Tridiagonal Matrices 

Another form which is encountered frequently, especially in the numeri¬ 
cal solution of partial differential equations and integral equations, is the 
so-called block-tridiagonal matrix 

Ai C x 

i?2 A 2 C*2 


(10) A « 


Bi A t Ci 


l B n A n J 

Here, each of the A t represents a square matrix, of order m u and each 
of the and C t are rectangular matrices which just “fit” the indicated 
pattern. That is, must have mi rows and m i - 1 columns, and C f must 
have m t rows and m i + 1 columns. Note that if all m { — m , then all 
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the submatrices are square and of order m . The order of the matrix A is 

n 

2 m u or again if all m { = m then the order is ( mn ). 

t = i 

A system with coefficient matrix of the form (10) may be solved by a 
procedure formally analogous to the previous factorization of a Jacobi 
matrix. Thus, let the system be 

(11) Ax = f 


where now 



and each x (0 and f (i) are m r component column vectors. That is, the com¬ 
ponents of the vector x are grouped into subsets, x <0 , and these subsets 
are to be “eliminated,” as in the Gaussian procedure, a group at a time. 
Thus, the method to be described is a special case of more general methods 
known as group - or block-elimination. 

Exactly as in Subsection 3.2 we seek a factorization of the form 

(13) A = LU = 


Aj 1 


h I\ 

b 2 a 2 


h r 2 

B 2 A 3 


h • 



• r„_j 

Bn A„ j 


In . 


where the /, are identity matrices of order m h the A y are square matrices of 
order m ; , and the are rectangular matrices with rows and m J + 1 
columns. Proceeding formally, we find that 

(14a) Aj = A u = A 1 ~ 1 C 1 ; 

(14b) = A { — £ i r i _ 1 , / = 2, 3, 

(14c) r t = A r l C u i = 2, 3,..., n - 1. 

From the definitions of the matrices involved, it is clear that each is 
rectangular of the indicated order and that the product B i T i ^ 1 and hence 
A i is square of order m { . The system (11) is now equivalent to 

(15) Ly = f, Ux = y 
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where y also has the compound form indicated in (12). We thus obtain 
formally, from (13) in (15), 

y (1> = Ar 1 ^ 1 ’ 

y w = A, -1 (f (0 - B, y (l_1) ), i = 2, 3.n, 

X (n) = y< n > 

x (0 = y (i) — r t x (i + 1) , i = n — 1 , n — 2, .. 1 . 

This method requires (or rather seems to require) the inversion of the 
n matrices A* and the formation of the 2{n — 1) matrix products Aj” l C|, 
BiTi-!' To estimate the total number of operations used, we consider the 
cases where all m t = m. Then with Gaussian elimination to obtain the 
inverses, we require from (1.10) (see discussion below on improving 
efficiency by not computing inverses explicitly), 

mrP ops. for all A ( _1 . 

The product of two square matrices of order m requires m 3 operations, 
hence, we have 

2(« — \)m 3 ops. for all A," 1 ^ and 
Thus, the evaluation of (14) involves not more than 

(18) (3 n — 2 )m 3 ops. 

The evaluation of (16) and (17) involves only products of m-component 
vectors by square matrices and we find 

(16) requires (2 n — 1 )m 2 ops.; 

(17) requires (n — 1 )m 2 ops. 

The total for (14), (16), and (17) is thus 

(19) (3rc — 2 )(m 3 + m 2 ) ops., 

to solve the system (11) with coefficient matrix (10). 

Notice that this number is much superior to estimates of the form 
i(nm) 3 which are appropriate for direct elimination methods applied to 
arbitrary systems of order ( nm ). In fact, if n — m the block-elimination 
scheme requires about 3m 4 operations, while from (1.9) straightforward 
Gaussian elimination uses on the order of $m 6 operations. The great 
gain in economy of operations is again due to the careful account taken 
of the large number of zero elements in A. In fact, even greater efficiency 
is attained if each T t is computed by solving the m linear systems, AjT* = 
C i5 and not by computing A^ 1 ; and if similarly (15) is solved for y (<) . 


(16) 

and 

(17) 




[Sec. 4] 


ITERATIVE METHODS 61 


The count in (19) is then reduced from the order of 3 nm 3 operations 
to the order of f nm 3 operations when we do not compute inverses. 

The justification of the block-factorization method is given in 

theorem 6. If the leading diagonal submatrices 


A 1 C, 
B 2 A 2 


= 


k = 1 , 2 


Cfc - 1 

„ B k A k 

of the original matrix (10) are non-singular , then the block-factorization 
in (14) may be carried out (i.e., the A* are non-singular). 


Proof This is left to Problem 2. 


PROBLEMS, SECTION 3 

1. If LU = A is a factorization of A satisfying (2), show that l ik u kj is in¬ 
dependent of the choice of l kk and u kk that satisfy (2a), 

2. Prove Theorem 6. 

4. ITERATIVE METHODS 

The previous direct methods for solving general systems of order n 
require about n 3 /3 operations. In addition, it has been indicated that, in 
practical computations with these methods, the errors which are neces¬ 
sarily introduced through rounding may become quite large for large n. 
Now we consider iterative methods in which an approximate solution is 
sought by using fewer operations per iteration. In general, these may be 
described as methods which proceed from some initial “guess,” x (0) , and 
define a sequence of successive approximations x (1) , x (2) , ... which, in 
principle, converge to the exact solution. If the convergence is sufficiently 
rapid, the procedure may be terminated at an early stage in the sequence 
and will yield a good approximation. One of the intrinsic advantages of 
such methods is the fact that errors, due to roundoff or even blunders, 
may be damped out as the procedure continues. In fact, special iterative 
methods are frequently used to improve “solutions” obtained by direct 
methods (see Subsection 4.3). 
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A large class of iterative methods may be defined as follows: Let the 
system to be solved be 

(1) Ax — f 

where det \A\ # 0. Then the coefficient matrix can be split, in an infinite 
number of ways, into the form 

(2) A = N - P 

where N and P are matrices of the same order as A . The system (1) is then 
written as 

(3) Nx = Px + f. 

Starting with some arbitrary vector x (0) , we define a sequence of vectors 
{x (v) }, by the recursion 

(4) Nx (v) = Px {v ~ 1) + f, y = 1,2,.... 


It is now clear that one of the restrictions to be placed on the splitting 
(2) is that 

(5) det N ^ 0, 

in which case the recursions (4) define a unique sequence of vectors for 
all x (0) and f. As a practical matter, it is also clear that N should be chosen 
such that a system of the form 

(6) Ny = z 

can be “easily 1 ’ solved. Furthermore, if greater accuracy is desired, it 
would be better to calculate with (4) in the equivalent form 

N(x iv) — x (v_1) ) = f — Tx <v “ 1) . 

This point will be discussed further in Subsection 4.3. 

The convergence of the sequence {x (v) } to the solution x of (1) is studied 
by introducing the matrix 

(7) M = N~ X P 
and the error vectors 

(8) e (v> = x (v) — x, * = 0, 1,2,.... 

Subtract (3) from (4) to obtain, upon multiplication by AT -1 , 

e (v) = Me {v ~ 1) 


(9) 


= M 2 e (v ~ 2) 


= A/ V e (0> , 


= 1 , 2 ,..., 
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where e (0) is the arbitrary initial error. Thus, it is clear that a sufficient 
condition for convergence , i.e., that lim e (v) — o, is that lim M v — O, 

V —► OO v-* oo 

and this is also necessary if the method is to converge for all e <0) . 

A matrix, M, that satisfies this condition is called a convergent matrix. 
The basic results characteristizing convergent matrices have been estab¬ 
lished in Chapter 1, Theorem 1.4 and its Corollary, which we restate here 
as 

theorem 1. The matrix M is convergent , i.e., 

lim M v — O , 

V —» 00 

iff all eigenvalues of M are less than one in absolute value. ■ 

(This condition is frequently stated as p(M) < 1 where p(M) is the 
spectral radius of M defined by 

p(M) = max | Ai[ 

i 

where the are the eigenvalues of M.) 

corollary 1. The matrix M is convergent if for any matrix norm , 
\\M\\ <1. ■ 

It is, in general, difficult to verify the conditions of Theorem 1. However, 
Corollary 1 may frequently be used to show that p{M) < 1. We have 
(see Chapter 1, Section 1, for the notation) 

corollary 2. The matrix M = ( m 0 ) is convergent if either 

n 

(10a) || A/1 oo = max V \m Xj \ < i; 

1 

or 

(10b) || M |! = max j> \m ti \ < 1. 

i i = i 

Proof We have shown in Chapter 1, Section 1, that ||Af||oo and || A/1| x 
are matrix norms. ■ 

Let us return to the iteration scheme (4)-(9) and assume it to be a 
convergent one. We introduce the notion of the rate of convergence, R , 
of the iterative scheme by setting 


(ii) 


R = — log p(M). 
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The significance of this quantity is most easily seen if we recall 
the corollary to Theorem 1.3 of Chapter 1, which states that: p{M) — 

gA.b. |[A/1| (here {|| ||} is the set of natural norms). Given the initial 

in id 

error, e (0) , (9) permits the estimate in terms of any natural norm 

lle^l < ||M|H|e«»||. 

Then for a given £ > 0, there is some norm such that 
(12) ||e' v) || < l P {M) + «]’||e®>|. 

On the other hand, again from (9), if e (0) is an eigenvector of M correspond¬ 
ing to the largest eigenvalue, ||e (v) || = [/3(A/)] v |[e (0) ||. Let it be required to 
reduce the amplitude of the error by a factor of at least 10 _m , m > 0. 
From (12), we see that, in some norm, the error amplitude is reduced by 
a factor close to [p(M)] v . The number of iterations required is the least 
value of v for which 


[p(M)] v < 10 " m . 

By taking logs and recalling that 0 < p(M) < 1, we obtain 
V - —log p{M) R 

Thus, the number of iterations required to reduce the initial error by the 
factor 10 ~ m is inversely proportional to R , the rate of convergence. 

4.1. Jacobi or Simultaneous Iterations 

A special case (attributed to Jacobi) of the previous general theory is 

(14) N = (Mu), P=N - A = - a u ). 

From (14) in (4), it is seen that the components jcJ v> of the yth iterate are 
simply computed with x (0) arbitrary by 



i = 1 , 2 ,..., n ; ^= 1 , 2 ,.... 

Thus, this procedure may be employed provided only that a n ^ 0 for all 
i — 1,2However, for the convergence of these iterations, Theorem 
1 requires that all roots of det |A/ — A _1 J D | = 0 satisfy |A| < 1. This 
equation can be written as, assuming det |iV| ^0, 
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(16) 


det \XN - P\ = det 


Aflu 

a l2 

a ln 

a 2 i 





- 1,n 

a nl 

a n2 

^nn 


In general, the roots of such an equation, for large n , are not easily 
obtained and so we seek simpler sufficient conditions for convergence, as 
given in Corollary 2. The relevant matrix M is easily obtained since N _1 = 
and thus 

(17) M = N~*P = 


Now conditions (10a) and (10b) of Corollary 


(18a) 


Iloilo 


max 


2 




flit 


2 become 
< 1, 


(18b) ||A/Hi = max V 

i i = i 
(l 

These tests are easily applied in practice. Since p(M) < \\M\\ we obtain a 
lower bound on the rate of convergence 


< 1. 


R = log 


1 

P (M) 


(19) 

_1_ 

~ ° g min (||Af ||i, || Af ||oo)* 

The operational count for the Jacobi iteration is simply obtained from 
(15); it is 

(20) n 2 ops. per iteration. 


Thus by (13), if these iterations converge they require a total of about 

m x n 2 
—R 0PS '’ 


to reduce the initial error by at least 10 " m . We see that if such an iterative 
method is to be at least as efficient as the direct elimination method it 
should have a rate of convergence and required accuracy factor, say 10 _m , 
satisfying 

m n 

R~ 3 
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(We assume here that m has been so chosen that the iterative solution will 
have an accuracy comparable to the accuracy obtained by the direct 
elimination method using the same number of digits in the arithmetic.) 

4.2. Gauss-Seidel or Successive Iterations 

It is clear from (15) that in the ordinary Jacobi iterations some com¬ 
ponents of x (v) are known, but not used, while computing the remaining 
components. The Gauss-Seidel method is a modification of the Jacobi 
method in which all of the latest known components are used. The term 
“successive” which is frequently applied to this method refers to the 
fact that “new” components are successively used as they are obtained. 
(In contrast, the previous scheme was called “simultaneous” since new 
components were not employed as found, i.e., the “new” components 
were introduced simultaneously at the end of the iterative cycle.) 

The obvious modification of (15) suggested by the above remarks is, 
with x (0) arbitrary, 

(21) *i V> = J- if - 2 ~ % 

a ii \ j = l J = t + 1 / 

/ = 1 , 2 ,.. n\ v = 1 , 2 ,.... 

The splitting of A that yields this iterative scheme is 
a u 

&21 a 22 

(22) N = • , P = N — A. 


I a nl a n2 * * * a nn J 

Since N is triangular, det \N\ ^ 0 is assured again if a H ^ 0; i = 1, 2,. .., 
n. The characteristic equation, whose roots must be in absolute value 
less than unity for convergence, is now of the form 

Aflu tfi2 * * - d\ n 

Afif 2 l A#22 

(23) det ; : 0. 

a n-l.n 

i \a n2 * ■ • Afl 

nn I 
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The roots of this equation are just as difficult to find as are those of 
(16), but the sufficient conditions of Corollary 2 are now much more com¬ 
plicated than those for the Jacobi iteration. However, a simple sufficient 
condition for convergence of the Gauss-Seidel method can be obtained. 
To derive this condition, we introduce the error vectors defined in (8) 
and find from (1) and (21) that the components of these vectors must 
satisfy 


(24) e{ v) 


y^ e <v)_ y a -" e { r i \ 

/=1«H ;=l + l 

/ = 1, 2,..., n; v = 1, 2,- 


The result to be proved may now be stated as 


lemma 1. Let the vectors e (v) , v = 1,2,..., be defined by (24) with e (0) 
arbitrary. Define the maximum norm , |[ * and factors , r u by 

(25a) l|e (v) ||oo = max |e$- v) |, 

i 

(25b) 2 

0*0 


<hi . 
a u ’ 


and let the matrix A satisfy 

(26) r = max r x < 1. 

i 

Then 


(27) 




and e (v) -> o ay ^ -> oo. 

Proof The lemma clearly follows from the inequalities 

(28) ||e (v) || a, < r||e (v “ 1) || 0O , v - 1, 2,...; 

which we shall prove by induction (on the components of e (v) )- From (24) 
with i = 1 we obtain, using (25) and (26), 


K v) l 


2 

1=2 


\a 


•kr i, i 


< e' 


.<»-D II 


/ = 2 


Eli 

an 


- I|e <v_1 >|| 


< r e' 


tv-DJI 


oo * r x 
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Now assume |4 V) | ^ Hl e(v 1) | 0 o for k — 1, 2 ,..i — 1. Then again from 
(24), recalling that r < 1 


ks v, i ^ 2 a f-m + 2 


Ar- Dl 


< 




^ i e<v_i) iu 2 a f = '•«iie <v - 1) iu 

y=i «« 

U*i) 

< r||e <v-l) || e0 . 

Thus, the induction argument is complete and since the above inequality 
is valid for all i = 1, 2,...,«, it follows by (25) that (28) is valid. ■ 

The convergence test of this lemma is easily applied and is formally 
identical to that of (18a) for the Jacobi method. However, it is not generally 
true that if the Gauss-Seidel method converges then the Jacobi method 
will converge, nor is the converse generally true. 

See Subsection 4.4 for other convergence tests. 


4.3. Method of Residual Correction 

This iterative scheme improves upon the accuracy of the approximate 
solution of (1) (obtained for example by the Gaussian elimination method), 
by using the approximate numerical triangular factorization of A. That 
is, the triangularization of (1), performed with t digits, produces & 
(lower), (upper), and x (0) . Now define 

N = P = N - A, 

(29) 

r (0) = f - Ax (0) . 

Observe that N is easily invertible, or rather that the equation 

= z 

may be readily “solved/’ since & and are triangular, by using n(n + 1) 
operations. [If J? = (j iy ) has s it = 1 for all i, then the number of operations 
used to solve J2V = z is n(n — l)/2, while the number for solving = w 
is n(n + l)/2. Hence, in this case n 2 is the operational count for solving 
Ny = z.] 

Now, the iteration scheme given by (4) is convergent if M , defined by 
(7), satisfies 


W\ = \I-N~'A\ < 1. 
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This inequality is satisfied if \\P\\ • ||>4 _1 || < \ (see Problem 5). In practice, 
(4) is not solved in the form 

(30) if» v) = - /l)x (v “ 1) + f. 

Rather, we introduce the change in the iterate by 
Sx (v-1) = x (v) - x <v “ X) 


and the .esidual of the iterate by 

(31) r (v_1) = f- A*'-” 

Then (30) can be written simply as 

(32) if^Sx'*" 1 *] - r (v " 1) , 

and the computations are done with these equations. 

The evaluation of r (v) involves n 2 operations; hence each iteration step, 
(31) and (32), requires n(2n + 1) operations (only 2 n 2 operations if s H = 1 
for all /). By using (1) and (31) the error satisfies 

(33) ||x (v) - x|| - || < ||r (v) ||. 

But from (32), the definition of Af, and the corollary to Theorem 1.5 of 
Chapter 1, 

M- 1 r <v) || = [|^" 1 A / '8x (v) || = |[(/- M) _1 8x (v) || 

< JMl, 

1 - q 

provided q = |Af || = \\N~ 1 (N — y4)|| < 1. 

As described in Subsection 1.2, the numerical solution of (32) produces 
a vector 8x (v_1) that satisfies 

(34) = r (v_1) , 
where 


if,.! = if + 8if v _ 1? ^ v _! - ^ 

The perturbations Sif v and h°ll y are small relative to if and respectively 
if the number of digits carried in the arithmetic calculations is large enough. 
Set 

N v - if v ^ v , P v = N v - A 


and 


M v - Nv-'Pv. 
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Then the error 

e (v) = x <v) — x, 

satisfies 

e (v) = M v _ 1 e< v - 1 > 


= M v _ 1 M v _ 2 e <v - 2) 


= M v _ 1 A/„_ 2 - ■■M 0 e ,a \ 


If ||A/f|| < q < 1 for all i, then ||e (v) || < < 7 v ||e (0) ||, and the scheme is con¬ 
vergent for any e <0) . 

As a practical matter, from equations (31) and (32), we see that r v o 
may occur only if the right-hand side of (31) is calculated to ever higher 
precision as v increases. On the other hand, equation (32) or equation (34) 
requires only single precision accuracy for r <v_1) , in order to determine 
8x (v_1) by using single precision arithmetic. 

4.4. Positive Definite Systems 

Many of the large order linear systems that arise in practice have 
real symmetric matrices which are positive definite. In such cases we can 
show that a quite general class of iteration methods converges. We state 
this result as 

theorem 2. Let A be Hermitian (of order n) and N be any non-singular 
matrix (of order n) for which 

(35) Q = N + N* - A 

is positive definite. Then the matrix 

M = / - N~ 1 A 

is convergent iff A is positive definite. 

Proof For any eigenvalue, A, and corresponding eigenvector, u, of 
M we have 


Mu = Au. 

But since N is non-singular this implies 
(36) Au — (1 - A)Mi, 

and so A = 1 iff Au — o. 

Now let A be positive definite (i.e., v*^v > 0 if v # o) and u be any 
eigenvector of M. Then o so that the corresponding eigenvalue A 
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of M satisfies A ^ 1. By taking the complex inner product of each side of 
(36) with u we then obtain 

1 _ u*Mi 

1 — A u*v4u 

The complex conjugate of this expression is, since u*Au is real, 

1 _ u*jV*u 

1 — A u*Au 

If we add these two equations we get 

D „ 1 _ u*(jV + N*)u 

2 R l - A u* An 

Now set A = a + ip and recall (35) to write this as 
2(1 - cc) , u*gu , 

(1 - a) 2 + P 2 ^ U*Au ’ 

since by hypothesis Q is positive definite. Hence, we have the inequality 

|A| 2 = a 2 + P 2 < 1. 

The sufficiency is thus demonstrated. The necessity part of the proof is 
indicated in Problems 1 and 2. ■ 

As an immediate corollary of this theorem, we have a result on the 
convergence of the Gauss-Seidel method for Hermitian matrices. 

corollary 1. Let A be Hermitian with positive diagonal elements. Then 
the Gauss-Seidel method for this matrix converges iff A is positive definite. 

Proof By the hypothesis on A it can be written as 

A = D + E + E* 

where D is a diagonal matrix of positive diagonal elements and E is strictly 
lower triangular (i.e., zeros on and above the diagonal). The Gauss-Seidel 
method applied to A , see (22), is equivalent to the splitting 

N=D + E , P = —E*. 

However, with this choice for N we have 

Q = N+ N* - A = D 


which is clearly positive definite. Thus the hypothesis of Theorem 2 
applies and the proof is concluded. ■ 
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Similarly, we obtain a result on the convergence of the Jacobi iterations 
as a special case of 

corollary 2. Let D — D* be non-singular and 

/)-(£+ E*) 

be positive definite. Then 

D~ 1 (E 4- E*) 

is convergent iff A = D + (£ + £*) ii positive definite. 

Proof. Use N = D in the theorem. ■ 


In the special case that D is a diagonal matrix, Corollary 2 yields the 
convergence of the Jacobi method for the matrix A. 

4.5. Block Iterations 

There are other splittings of A which in many important cases yield 
rapidly convergent iterations. In particular, since tridiagonal and block- 
tridiagonal systems are easily solved, it is natural to consider iterations in 
which N has either of these forms. Many of the large order systems which 
arise in the finite difference methods for partial differential equations 
suggest such block iterations. More generally, if the elements “close” to 
the diagonal of a matrix are large compared to the other elements, it is 
usually advantageous to include all of these large elements in N (assuming, 
of course, that the resulting systems which determine the iterates are still 
easily solved). Of course, in all applications of these block methods, 
attempts should be made to prove the convergence of the method and, 
if possible, to estimate the rate of convergence. 


PROBLEMS, SECTION 4 

1. Let the sequence {v v } be defined, with v 0 arbitrary, by 

v v +1 = A/v v , v = 0, 1,..., 

where M = I - N~ l A and A is Hermitian. Then 

(a) Verify the identity 

v y *Tv v — v* +1 Tv v + i = (v v — v v + i)*C(v v — v v + 1 ) 
where Q = N + N* — A; 

(b) If Q is positive definite show that {v v *Tv v } is a non-increasing sequence. 
(In fact, the sequence is strictly decreasing if 1 is not an eigenvalue of M.) 

2. Use part (b) of Problem 1 to show that if M is convergent then A is 
positive definite. 
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[Hint: Use proof by contradiction; assume v 0 */fv 0 < 0 for some v 0 ^ o. 
Then, since M is convergent, v x ^ v 0 . Therefore, 

Vv’Mvy < Vi *A\ 1 < v 0 */fv 0 < 0. 

This is a contradiction, since the convergence of M implies v, —> o.] 

3. Analyze the convergence of the Jacobi and the Gauss-Seidel iterative 
methods for the second order matrix 


-CO- 


IH < 1, Xo * o. 


4. Determine when the Jacobi iterative method converges for the com¬ 
pound matrix 


A = 

of order n . 

[Hint: Work with the compound vectors 
vectors 


with / and S 


(', >■ 

0 

C) - 0 - 0- 


Define the compound error 


Find a recursion formula for {e v } that doesn’t involve {g v }.] 

5. The convergence of the residual correction scheme defined by (30) is 
assured if ||/ - < 1. Verify that this inequality holds if 

Ill'll - M _ 1 l! < b 

[Hint: Let B = A~ l N. Then 

/ - N~ 1 A = I - B 1 = B \B - /) 

= B-\A~ X P ). 


Note that B = A X P + / and therefore, by the remark following the corollary 
to Theorem 1.5 of Chapter 1, we have ||£~ 1 || < 2.] 

5. THE ACCELERATION OF ITERATIVE METHODS 

Given any iteration procedure, for a specific system of equations, it 
may be possible to improve its rate of convergence by a simple device. 
Such modifications, which we call acceleration , are frequently termed 
“extrapolation“over-relaxation ” or various other names depending upon 
the problem to which they are applied or perhaps upon the particular 
form of device which is used. In any event, the general principle common 
to almost all acceleration procedures is the introduction of a splitting, 
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similar to (4.2), which depends upon some real parameter, say a, in an 
“appropriate” manner. The splitting may be denoted by 

A = N(a) - P(a) 

and is still subject to the requirement that 

det | N(a)\ ^ 0. 

(This will place some restriction on the permissible values of a.) Now, 
as has been shown in Section 4, an iteration scheme based on the above 
splitting will converge, for arbitrary initial vectors, iff all eigenvalues of 

M{a) - N~ 1 (a)P(a), 

are in absolute value less than unity. 

Let these eigenvalues be denoted by 

Ai(oc), i — 1, 2,.. n ; 

where, as indicated, their values may depend upon the choice of the 
parameter a. Now if a value of a can be determined such that 

p[M(a)] = max | A f (ce)| < 1, 

f 

the scheme will converge. Furthermore, since the rate of convergence is 

the convergence is “best” for the value a = a* such that 
p[Af(a*)] = min P [M{ a)]. 

a 

The selection of an optimal a* is the most important feature of acceleration 
procedures. 

Some acceleration procedures that are commonly used are described 
as follows: Let some definite splitting, (4.2), be given by 

(1) A — N 0 — P 0 , 

where jV 0 and P 0 are fixed matrices with det [A^ 0 | =£ 0. Let the relevant 
eigenvalues of this scheme, i.e., those of 

(2a) M 0 = No-'Po, 

be 

(2b) A,, i = 1,2,.. n. 
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Then we introduce the one-parameter family of splittings 

N(a) = (1 + a) N 0 , 

(3) 

P(a) = (1 + a)N 0 — A — P Q aN 0 . 

In order that det |AT(a)| ^ 0, we need only require a # — L Then if the 
eigenvalues of M{a) = N~\a)P(d) are denoted by /^(a), / = 1, 2,.. n 9 
we claim that 

(4) *(“) = TTT i=h2 ’ 

The verification of (4) requires only a simple application of the defini¬ 
tions of eigenvalue and eigenvector. Specifically from (2) and (3) we have 

M(a) = N~\a)P{a) = A^ 0 _1 (/ , 0 + «N 0 ) 


1 + a 


/ + 


1 -f a 


A/ 0 . 


Thus, if u is any eigenvector of M 0 belonging to the eigenvalue A, that is 
M 0 u = Au, we obtain from the above 


M(a) U 


1 + a 


u + 


1 + a 


A + a 

TT ^ 1 


That is, u must also be an eigenvector of M(a) belonging to the eigenvalue 
(A + ce)/(l 4- a). Conversely, if M( a)v = we obtain 


/xV — M (a)v = 


1 + a 


v 4- 


1 + a 


M 0 v, 


or, since 1 + a ^ 0 by assumption, 

M 0 y = [/x(l -(- a) — a]y. 

Thus every eigenvector of M( a) is an eigenvector of M 0 and (4) is established 
for all a ^ — 1. 

In order to determine convergent schemes of the form (3), we must 
study the relation (4). This is done first for a very important class of 
special cases in which the “best” such scheme can be obtained. These 
results may be stated as 

theorem 1. Let N 0 and P 0 be such that the eigenvalues A ( of N 0 ~ 1 P 0 are 
all real and satisfy 

(5) Ai < A a < • • ■ < A n < 1. 

Then the scheme (3) will converge for any a such that 

1 H- A x 


( 6 ) 


a > — 


2 


> - 1 . 
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Furthermore , the largest rate of convergence for these schemes is obtained 
when 


(7) 


a = a* 


+ K 
2 


for which value 


(8) p[M( a*)] = min p[M( a)] = min (max |^(a)|) 

a a \ i= 1 / 



< 1. 


Proof A scheme of the form (3) will converge if [/A|(a)| < 1, i = 
1, 2, .. n. Let us introduce 


(9) 


x = 


1 

1 + a’ 


m, = A, - 1, j = 1, 2,..., n. 



Figure 1 


Then (4) can be written as 

(10) = m { x +1, /=1,2 .«, 

where by (5), all m { < 0. The equations (10), for the ^ as functions of x, 
represent n straight lines with negative slopes. Let us assume that the 
A i have been ordered as in (5). Then by (9) we have 

m 1 < m 2 < * * ♦ < m n < 0 , 

and all the lines ( 10 ) are bounded by those for i — 1 and i = n (see Figure 
1). Thus, we have for 

^ x > 0 : = m x x + 1 < ^ < m n x + 1 = 

a : < 0 : fi n = m n x 4 - 1 < m < m ± x + 1 = ft l9 


i = 1, 2,..., n. 
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Clearly then, all /q < 1 iff x > 0. Similarly, all /q > — 1 iff ^ > — 1 or 
equivalently x < —2 jm x . Thus, |/q| < 1 iff 0 < x < —2 ,/m u and using 
(9) we obtain (6). 

For x > 0 we have by (11) 

n 

fx = max |^. t | = max(|w 1 x 4- 1|, \m n x 4- 1|). 

From Figure 1 it is then clear that 


min fj. = \ mi x* + 1| = | m nXit + 1| = 1 

x > o tn 1 4- m n 

where x* = 4- w n ). Upon applying (9) again, we obtain (7) and 

(8) and the proof is complete. ■ 





By an exactly analogous proof similar results can be obtained for the 
case where all A { > 1 (see Problem 1). 

In the general case, the A y and hence also the /x y (a) will be complex. 
Then the schemes (3) will be convergent if the complex numbers /*_,(«) s 
f y (a) -I- %(a) all lie in the interior of the unit circle 

(i2) H 2 = e + v 2 = i, 

of the (£, ^-plane. The relations (4) can now be considered as special 
points of the mapping of the A = x 4- iy plane into the /x-plane 


(13) 


M = 


A 4- a 
1 4- a 


a 7^ — 1 . 


This is a special case of the well-known Mobius transformations studied 
in function theory. If a is real, we can easily verify directly that the unit 
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circle (12) in the (jl -plane corresponds to a circle in the A-plane given by 
(14) (jc + a) 2 + y 2 = (1 + a) 2 - 

It can also be shown that any interior point of the circle (14) is mapped by 
(13) into an interior point of (12). The transformation (13) is illustrated in 
Figure 2 for a > — 1 and a < — 1. 

From this figure it can be seen that convergent schemes can be found 
if the eigenvalues A* satisfy either 

(15a) Re (A f ) < 1, i* = 1,2, 

or 

(15b) Re (A^ > l, i = 1,2, 

That is, a '‘convergent” value of a can be obtained corresponding to any 
circle in the A-plane which has the properties: 

(i) the center is on the real axis; 

(ii) it passes through the point (1,0); 

(iii) all eigenvalues A f are interior to it. 

If such a circle exists, then we call the coordinates of its center ( —a, 0) 
and this value of a yields a convergent scheme. However, now it is not a 
simple matter to determine the best value of a. 

5.1. Practical Application of Acceleration Methods 

It is assumed that the basic scheme determined by (l) can be efficiently 
computed. That is, to solve 

Ax — f 

we consider the iterates, with x (0) arbitrary, given by 

(16) N 0 x (v) — P 0 x (v ~ 1} + f, v = 1,2,.... 

We assume that such systems can be solved in an efficient manner. Now 
the iterates, x (v) , satisfy the system of equations 

(17) x (v) = M 0 x (v_1) + g, v = 1,2,..., 

where g = jV 0 _1 f and M 0 is defined in (2a). 

The acceleration (3) corresponding to this procedure yields with y (0) 
arbitrary 

(18) y (v) = M(a)y (v-D + __|_g 5 1,2,..., 

since iV _1 (a)f = 1/(1 4- a)N 0 -1 f. In terms of M 0 these iterates can be 
written as 

( 19 ) y (v) = _^_ y (v-l, + _^_ (A/oy <v-l, + g); V = 1,2,.... 
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A comparison of (17) and (19) yields some insight into the relationship 
between the basic scheme and its accelerated version: 

If a = 0 the basic scheme results. If a > 0 then for any real numbers 
a and b , 

min (a, b) < -r -—— a + 7 —-— b < max (a, b ). 

1 + a 1 + a 


So in this case, the acceleration scheme yields a vector on the line segment 
joining the previous iterate and what would have been the next iterate of 
the basic scheme. The term “interpolated iteration” is frequently employed 
to describe this type of acceleration. If — 1 < a < 0 then 

a 1 (< b if b < a, 

1 +a 1 +a if b > a\ 


and the acceleration scheme yields vectors with components whose values 
are definitely not between those of y (v “ X) and M 0 y (v_1) + g. The scheme is 
now termed an “extrapolated iteration.” Similarly, the remaining case 
a < — 1 is such an extrapolation method. 

To compute using the scheme (19), we define the vectors z (v) by 

(20a) A 0 z (v) - P 0 y iv - 1} + f, v = 1,2,..., 


and then write (18) as 


( 20 b) 


y (v) - 


1 + 


1 + a 


Thus, as in the basic scheme, the calculations only require the solution of 
systems of the form (20a). [Note that in (20), z (v) is defined by a recursion 
which is similar, but not identical, to (16).] 

In general, the eigenvalues A ; , or in particular A : and A n , of the basic 
scheme will not be known. But it may be possible to approximate the 
value a = a* which yields the fastest convergence. This is accomplished 
by some test calculations that are easily performed: 

Since the rate of convergence is independent of the inhomogeneous 
term, f, we seek the best scheme for solving 


(21) Ay = o. 

Since it is assumed that det \A\ / 0, this system has the unique solution 
y = o. Apply the scheme (18) with some fixed value of a = to (21) and 
compute [actually use ( 20 )] 


( 22 ) 


y‘ v) («i) = A/ V («i)y‘ 0 >, 


v = 1 , 2 ,.. 
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where y (0) is an arbitrary but fixed initial vector. If the value a x yields a 
convergent iteration, compute until 

l|y (v) («i)|| = max |>>5” ) («i)[ ^ 10--, 

where m is a fixed positive number. This requires some minimum number 
of iterations which we may call Ko^). Repeat this procedure with the same 
y (0) and w, for a sequence of values of a — a 2 , a 3 ,..., to obtain the corres¬ 
ponding sequence {^(cq)}. The approximate value for a* can be obtained 
by plotting the points (v(cq), a*) and choosing that a which seems to mini¬ 
mize the “function” v(a). 

An obvious alternative is to compute the sequence {||y <A0 ( a t)ll} using a 
fixed number, say N , of iterations. Then an approximation to a* is that 
value which minimizes ||y (jV> («) |[ * 

5.2. Generalizations of the Acceleration Method 

There are numerous generalizations of the acceleration method which in 
fact are more powerful than the scheme described by (3). The simplest type 
of generalization proceeds from a single basic splitting of the form (l)-(2) 
but employs cyclically a fixed sequence of acceleration parameters, say 
a l9 a 2 , ... , a r . Specifically for i = 1, 2, ..., r, define A(cq) and /'(cq) as in (3) 
and the corresponding matrices M(a t ) by 

(23) M(ad - N - 1 (a t )P(«d = T~jr 1 + T~T- M 0 , i = l, 2,..., r. 

1 4- ce f 1 -f a t 

The iterations are defined as follows, with x (0) arbitrary, for v = 1,2,...: 


(24a) 

o 

II 



(24b) 

y <v ' s> = M (« s )y <v ' s + N-\a s ) f, 

•y = 1,2,.. 

r; 

(24c) 

J^(v) __ y(v. r) 




Again, each of the r vectors of (24b) can be obtained by solving a linear 
system of the form 


AT(a s )y (v ' s) = P(« s ) y (v ' s " 1} + f. 

With this notation, one iteration of this generalized acceleration 
scheme requires the same number of computations as r iterations in the 
ordinary acceleration scheme. The convergence of this method can be 
analyzed by means of the equivalent formulation 

(25) X ,v) = M(a lt a 2 ,..., a,)x <v - 1) + g, ,= 1,2,..., 
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where by (24) we find that 

M(aj, a 2 , ■ ■ ., °t r ) s M(a r )- ■ ■M(a 2 )M(a 1 ); 

(26) 

g= [N-\a r ) + M(* r )N-\ a , _i)+--- 

As in the proof of (4), the eigenvalues of M{a lf a 2 ,..a r ) can be deter¬ 
mined, by using (23), in terms of the eigenvalues A t of M 0 . We find in 
fact, that 


(27) mi(«i, « 2 , • • • > «r) = n T Z ~~ i ’ i=l, 2, 

are the relevant eigenvalues. Now if we define the rth degree polynomial 


(28) 


PrW 


r 

n 

i =i 


A + a ; - 
1 Hh 


then convergence is implied by |/ > r (A f )[ < 1 for i — 1, 2In particu¬ 
lar, if all the eigenvalues of A/ 0 are real and lie in the interval 


a < A < b. 


then convergence is implied by 

|P r (A)| <1, a < A < b. 

In this case, the fastest convergence can be expected for that polynomial 
which has the smallest absolute magnitude in the indicated interval. 
Such problems are considered in Chapter 5, Section 4, and it is found that 
the Chebyshev polynomials can be used to find the polynomials of “least 
deviation from zero.” Hence, in principle, if a and b are known, the best 
acceleration parameters a u a 2 ,..a r can be determined (see Problem 2). 

Another type of generalization of the acceleration method is obtained 
by employing a sequence of different basic splittings, say 

A = N t - Pi, i = 1, 2,..r; 


and their corresponding accelerated forms 

Nfad, P i( a i)* 

An application of this technique is contained in subsection 2.2 of 
Chapter 9. 


PROBLEMS, SECTION 5 

1. State and prove a theorem analogous to Theorem 1, for the case that 
(5) is replaced by 

1 < A x < A 2 < * * • < A n . 
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2. If (5) is replaced by 

-1 < A, < b < 1, 

compare the efficiency of 

(a) the method using three acceleration parameters {aj such that { — aj are 
the zeros of the Chebyshev polynomial of third degree for the interval [—1, b], 
with 

(b) the method using the single parameter a = (1 - b)j2. 

6. MATRIX INVERSION BY HIGHER ORDER ITERATIONS 

The previous methods of this chapter have been primarily concerned 
with solving linear systems. Of course, as demonstrated in Section 0, 
they can all be employed to determine the inverse of any given non¬ 
singular matrix, A. We consider now an iterative method for directly 
computing A~ l . This method is a means for improving the accuracy of an 
approximate inverse, say R 0 , obtained by other procedures. However, 
in many cases the present method is feasible when the initial approximate 
inverse is assumed to have the simple form R 0 = a>I. Because of the large 
number of operations involved in matrix multiplication, these schemes are 
not generally used. 

Assume that R 0 is any approximation to A _1 and define the error in this 
approximation by 

(1) E 0 = / - AR 0 . 

Clearly, if R 0 = A~ l then E 0 ~ 0. Now with R 0 as the initial approxima¬ 
tion, we define a sequence of approximate inverses by 

(2a) R v — R v -i(I H- E v ~i + Ev-i *T ■ * * T ^v-i)? v — 1, 2,..., 

(2b) E V = I-AR V , v — 1,2,.... 

Here, p is an arbitrary fixed integer not less than two. (This method is 
usually described for the case p = 2 but, as will be shown, the “best” 
value for this integer is p — 3.) From (2) we obtain 

E v — 1 — AR V 

= / - AR V ^(I + E v _, + ***+£? Z{) 

= /-(/- E*-i )(/ + Ey-i + • ■ ■ + E*Z{) 

= E P y- 1, v = 1,2,- 

Thus, in one iteration, the error matrix is raised to the /?th power and the 
method is consequently called a /?th order method. Apply (3) recursively, 
to find that 


( 4 ) 


F = F n pV 
-^0 •> 


= 1,2 . 
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Also from (2b) and the above we have 
A - 1 - R v = A-'E, 

(5) = A-'Eo’” 

= *o(/ - Eor'W; 

where we have used (1) and the assumption that det [/ — E 0 \ ^ 0. It is 
now clear that the iterations converge when E 0 is a convergent matrix 
(see Section 4). 

Let us assume that E 0 is a convergent matrix. Then its eigenvalues 
A ^Eq) satisfy | A { [ < 1, i — 1, 2from Theorem 4.1. Let 

P = p{E 0 ) = max |A ( |. 

i 

Then, since the eigenvalues of E 0 P are A t p , the error E v of (4) vanishes 
like p p \t 

We now pose the problem of determining the “best” value of p to be 
used for any convergent E 0 . By “best” we shall mean that procedure 
which for the least amount of computation yields an approximate inverse 
of desired accuracy. Alternatively, the best scheme could be defined as 
the one for which a given amount of computation yields the most accurate 
inverse. Adopting our usual convention we find from (2), since the product 
of two matrices requires n 3 operations, that v iterations of a />th order 
scheme require 

vpn 3 ops. 

If only K operations are to be permitted the number of iterations allowed 
is 

K 

V — -3> 

prr 

where we assume K/(pn 3 ) is an integer. Thus, the principal eigenvalue is 
reduced to 

pP v — p(p) K i(*> n3 ) _ p(P llp ) K < n3 ' 

Since K y n, and p < 1 are independent of p , we find that the error is 
minimized when p llp is a maximum. Now it is easily shown that the 

maximum of x llx is at x = e = 2.718_But a simple calculation (pointed 

out by M. Altman) shows that for integers p the maximum is at p — 3. 

In order to apply the procedure (2), we must have an initial estimate R 0 
such that E 0 = I — AR 0 is convergent. For a very important class of 


f This is only rigorously true if the elementary divisors of E 0 are simple. But, by the 
corollary to Theorem 1.3 of Chapter 1, the statement isn’t very wrong. 
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matrices such an estimate is easily found. This result is contained in 

theorem 1 . Let A have real eigenvalues in the interval 

0 < m < Xj < M, j — 1, 2,.. n. 

Then if R 0 {<*>) = ojI and E 0 (w) = / — wA, E 0 (oj) will be convergent for 
all a* in 

(6) 0 < « < h 


Further if p(w) is the spectral radius of E 0 (w), i.e., p(w) then 

M — m 2 


( 7 ) 


p(aj^) = min p(aj) — 


0 <<o< 21M 


M + m 


M + m 


Proof This theorem is essentially a restatement of Theorem 5.1. 
If we make the association ( x , mj) (<u, — A ; ) in the proof of that theorem, 
the above follows. ■ 


PROBLEMS, SECTION 6 

1. Newton’s method for improving R 0y the approximate inverse of A , is 
formally obtained by setting 

A = (R 0 + Stfo)- 1 

= [R 0 (I + R 0 ~ 1 8Ro)]~ 1 

= (/ + Ro~ 1 8Ro)~ 1 Ro~ 1 - 

Therefore, 

AR o = (/ + Ro-'SRo)- 1 Z I - Rq~^8Rq, 

Solve for 8R 0 . Does this formula fit into the iteration scheme (2) for p = 2? 

2. Show that if A is non-singular, the choice R 0 ^ aA* with a = 1/tr (AA*) 
produces a convergent matrix E 0 in (1). 




3 


Iterative Solution of 
Non-Linear Equations 


0. INTRODUCTION 

In this chapter, we consider iterative methods for determining the roots 
of equations 

f(x) = 0 

where f and x are vectors of the same dimension k : i.e., if k = 1 we have a 
single equation; if k = n we have a system of n equations. Most of the 
iterative methods can be written in the form x n + 1 = g(x n ) for some suitable 
function g and initial approximation x 0 . The convergence of this iteration 
process is assured if the mapping g(x) carries a closed and bounded set 
S C C k into itself and if the mapping is contracting , i.e., if ||g(x) — g(y)| < 
M ||x — y|| for some norm, “Lipschitz” constant M < 1, and all x, y e S. 
Such an iteration scheme is sometimes referred to as the Picard iteration 
method , or as a functional iteration method. It can be easily shown under 
these conditions that g(x) has a unique fixed point a in S satisfying 

o = g(o). 

We shall study this contracting mapping theorem in one or more dimen¬ 
sions and the related results which are basic to many of the iterative 
methods of this chapter. 

Usually the iterative methods are valid for real and complex roots. 
However, in the latter case complex arithmetic must be incorporated 
into the appropriate digital computer codes and the initial estimate of 
the root must usually be complex (see Subsection 4.4 for an exception). 
The iterative methods require at least one initial estimate or guess at the 
location of the root being sought. If this initial estimate, say x 0 , is “suf¬ 
ficiently close” to a root, then, in general, the procedures will converge. 
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The problem of how to obtain such a “good” x 0 is unresolved in 
general. Frequently, a good estimate of the root is known to the problem 
formulator (i.e., the engineer, physicist, mathematician, or other scientist 
who is interested in the solution) or can be found by an analytical study. 
For many purposes merely graphical accuracy (about two decimal figures) 
is needed for the initial value. In these cases, one may tabulate the func¬ 
tion and plot the data in one or two variables or “fit” linear forms 

k 

a i0 + 2 a u x f t0 /i( x ) t0 the approximate starting values. If a digital 

;' = 1 

computer is to be employed, this plotting method is quite convenient since 
all of the required function evaluations will be contained in the eventual 
machine code for the problem. 

As a general empirical rule, the schemes which converge more rapidly 
(i.e., higher order methods) require closer initial estimates. In practice, 
these higher order schemes may require the use of more significant digits 
in order that they converge as theoretically predicted. Thus, it is frequently 
a good idea to use a simple method to start with and then, when fairly 
close to the root, to use some higher order method for just a few iterations. 

For polynomial equations in one variable we know much about the 
roots. While the general iteration schemes apply to them there are also 
special methods which can be used to obtain the zeros of polynomials. 
Such considerations are to be found in Section 4. 


1. FUNCTIONAL ITERATION FOR A SINGLE EQUATION 

Let us consider a scalar equation of the special form 

(1) X - g(x) = 0, or x = g(x). 

[It is clear that any equation /(x) = 0 can be written equivalently in this 
form by defining g(x) = x — /(x).] If x 0 is some initial estimate of a 
root of (1), a scheme naturally suggested is to form the sequence 

(2) x v + 1 = g(x v ), v — 0, 1,.... 

An important result concerning the convergence of this procedure and a 
proof of the existence of a unique root is contained in 

theorem 1. Let g(x) satisfy the Lipschitz condition 
(3a) | g(x) - g(x')\ < A|x — x'\, 

for all values x, x in the closed f interval I = [x 0 — p, x 0 + p] where the 

t Unless otherwise specified: [a, b] denotes the closed interval, a < x < b; (a, b) 
denotes the open interval, a < x < b\ {a, b] and [a, b) denote respectively the half¬ 
open intervals a < x < b and a < x < b. 
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Lipschitz constant , A, satisfies 

(3b) 0 < A < 1. 

Let the initial estimate , x 0 , be such that 

(4) |*o “ g(*o)l ^ 0 - X )p- 
Then 

(i) all the iterates * v , defined by (2), lie within the interval I; i.e., 

(5) * 0 - p < x v < x 0 + p, 

(ii) (existence) the iterates converge to some point , say , 

lim jc v = a, (in fact , |jc v — oc| < A v p) 

V-* 00 

which is a root of ( 1), and 

(iii) (uniqueness) a is the only root in [x 0 — p, x 0 + p]. 

Proof We prove (i) by induction. Since x x — g(x 0 ), we have by (3b) 
and (4) 

(6) |* 0 - *i| < (1 - A )p < p 

and hence x x is in the interval (5). Assume this true for the iterates 
*!, x 2 , • •* v - Then from (2) 

|*v + l - *v| = I g(*v) - g(*v- 1)| 

and by the inductive assumption jc v and x v _! are in the interval (5). 
Thus, by (3a), the Lipschitz condition yields 

|* V+1 — Xy\ < A|jC V — *v-l| 

< A |* v -l — -ATy — 21 

(7) : 

< \ v \x 1 - * 0 | 

< A v (l - A ) P , 


Here we have used (2) and (3a) recursively and then applied (6). However, 

|*V+1 - X 0 \ = |(*V + 1 “ Xy) + (Xy - * V _i) +•"+(*!- * 0 )| 

< |* v + 1 - * v | + |* v - *v-l| + ' ■ • + 1*1 - *o| 

< (A v + A v_1 + ■ • • + 1X1 - % “ (1 - A V + 1 ) P 

^ p> 

which completes the proof of (i). 
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To prove part (ii), we first show that the sequence {x v } is a Cauchy 
sequence. Thus, for arbitrary positive integers m and p , we consider 

l-X-m ^m + p| K-X-m -*m + l) ~h C^m + l “ •X-m + 2) - l - 

~h + p - 1 X m + p) [ 

< \x m - x m + 1 \ 4 |* m + i - * m + 2 | 4 • • ■ 

(8) d" l-^-m + p-l -^m + pl 

< (A m 4 A m + 1 4 • • • 4 A m + P_1 )(l - A )p 

< (1 - A p )pA m . 

Here we have used the inequalities (7) which are valid since (i) has been 
proved. Now given any e > 0, since A in 0 < A < 1 is fixed, we can find 
an integer N(c) such that \x m — x m + p \ < c for all m > N(c) and p > 0 
(we need only take N such that A* < c/p). Hence the sequence {jc v } is a 
Cauchy sequence and has a limit, say a, in /. Since the function g(x) is 
continuous in the interval /, the sequence {g(jc v )} has the limit g(a) and 
by (2) this limit must also be a; that is, a = g( a). Now \x v — a| = 
\g(x v - 1 ) - g(cc)\ < A|x v _ x - a\; hence \x v - a| < A v |x 0 - a| < pA\ 

For part (iii), the uniqueness, let 0 be another root in [x Q — p, jc 0 4 p]. 
Then, since a and 0 are both in this interval, (3) holds and we have, if 
|« — 01 / 0, 

1“ - 01 = l£(“) “ £(0)I ^ 4 - 01 < 1“ - 01- 

This contradiction implies that a = 0 and the proof of the theorem is 
concluded. ■ 

corollary. If \g\x)\ < A < 1 for \x — x 0 \ < p and (4 ) is satisfied, then 
the conclusion of Theorem 1 is valid . 



Figure 1 
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Proof\ The mean value theorem implies gOO — g(x 2 ) — g'(£)( x i — * 2 )- 
Whence A may serve as the Lipschitz constant in (3a and b). ■ 

A geometric interpretation of Theorem 1 is suggested by Figure 1. This 
illustrates the case with g(jt 0 ) — x 0 = 77 > 0 and the triangles I and II, deter¬ 
mined by lines with slope ± A through (x 0i g(x 0 )), are the regions in which 
the values of g(x ) lie for x 0 — p < x < x 0 + p. It is easy to verify that 

(a) if A > 1, the line y — x will not intersect the upper boundary of 
triangle I or if 

(b) A < 1 and 77 > (1 — A)p that the line y — x will not intersect the 
upper boundary of triangle I and hence may not intersect an ad¬ 
missible function g(*) in the interval [x 0 — p, x 0 + />]. 

In other words, the conditions A < 1, [ 77 ! < (1 — A )p are necessary to 
insure the existence of a root for every function g(x) satisfying conditions 
(3a and b). 
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Figures 2a and 2b illustrate convergent iterative sequences for functions 
g(x) with positive and negative slope, respectively. Note that the sequence 
{jt n } converges to a monotonically for £(x) of positive slope and converges 
with values alternately above and below a for g(x) of negative slope. 

Another convergence theorem is 


theorem 2. If x — g(x) has a root at x — a and in the interval 
(9a) \x — a\ < p 

g(x) satisfies 

(9b) |g(x) - g{a) I < A|* - «|, 

with A < 1, then for any x 0 in (9a): 

(i) all the iterates x v of { 2) lie in the interval (9a), 

(ii) the iterates x v converge to a, 

(iii) the root a. is unique in this interval. 

Proof Part (i) is again proved by induction. By hypothesis x 0 is in 
(9a) and we assume x u x 2 , . .x v _ x are also. Then since a = g(a) we have 
from (2) 

1“ - *v| = |g(«) - g(^v-l)| 

(10a) 

< A|a - JC.-jI, 

whence A < 1 implies (i). Furthermore, 

\cc — x v \ < A|a — Xv-il, 

(10b) < A 2 |a - x v _ 2 |, 


< A v |a - X 0 \. 

By letting v^-oo, we see that x v a, since A < 1 . The uniqueness follows 
as in Theorem 1. ■ 

Notice that condition (9b) is weaker than the general Lipschitz condition 
for the interval (9a), since the one point a is fixed. This feature is applicable 
in Problem (1). 

We can now prove a corollary (with a hypothesis which is oftentimes 
more readily verifiable). 

corollary. If we replace (9b) by 

(9b)' |g'(*)| < A < 1, 

then the conclusions (i), (ii), and (iii) follow . 
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Proof. From the mean value theorem and (2) 

(10a)' « - = g(a) - £(*„_!) = g'(fv-i)(« - *,-i). 

Hence (10a) follows from (9b)'. Therefore (10b) and the rest of the proof 
of Theorem 2 apply. ■ 

It is clear from (10a)' that, if the iterations converge, £ v -* a and thus 
“asymptotically” (as v-±co) 

(11) |a - x fc + v | x: [g'(“)l v l« - x k \, 

for large enough k. The quantity 

(12) A = |g'(«)| 

is frequently called the (asymptotic) convergence factor and in analogy 
with the iterative solution of linear systems 

(13) * = logi 

may be called the rate of convergence (if A < l). The number of additional 
iterations required to reduce the error at the &th step by the factor 10 ~ m 
is then, asymptotically, 


(14) 


V = 


m 

R' 


We assume, in these definitions, that ^ = |g'(a)[ # 0 and define such an 
iteration scheme ( 2 ) to be a first order method ; higher order methods are 
considered in Subsection 1.2. 


1.1. Error Propagation 

In actual computations it may not be possible, or practical, to evaluate 
the function g(x) exactly (i.e., only a finite number of decimals may be 
retained after rounding or g(x) may be given as the numerical solution 
of a differential equation, etc.). For any value x we may then represent 
our approximation to g{x) by G(x ) = g(x) + 8 (x) where S(x) is the error 
committed in evaluating g(x). Frequently we may know a bound for 
8(x), i.e., \8(x)\ < S. Thus the actual iteration scheme which is used may 
be represented as 

(15) * v + 1 =g(X v ) + a v , V = 0,1,2,..., 


where the X v are the numbers obtained from the calculations and the 
S v = S(A" V ) satisfy 


(16) 


|Sv| < 8, 


v = 0, 1,- 
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We cannot expect the computed iterates X v of (15) to converge. However, 
under proper conditions, it should be possible to approximate a root to 
an accuracy determined essentially by the accuracy of the computations, 8. 

For example, from Figure 3 it is easy to see that for the special case 
of g(x) = a + X(x — a), the uncertainty in the root a is bounded by 
±8/(1 — A). We note that, if the slope A is close to unity the problem is 
not “properly posed.” We now establish Theorem 3 which states quite 
generally that when the functional iteration scheme is convergent, the 
presence of errors in computing g(x), of magnitudes bounded by 8, 
causes the scheme to estimate the root a with an uncertainty bounded by 
±8/(1 — A). 

theorem 3. Let x = g(x) satisfy the conditions of Theorem 2. Let X 0 be 
any point in the interval 

(17a) |« — jc| < p 0 , 

where 

(17b) 0 < Po < p - yfj- 

Then the iterates X v of (15), with the errors bounded by (16), lie in the 
interval 

|« - X,\ < p, 

and 

(18) |« - X k \ < + \f 0 - 

where \ k —> 0 as k -» oo. 



Figure 3 
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Proof. It is clear that |a — X Q \ < p 0 < p 0 4- S/(l — A) < p. Then for 
an inductive proof assume X 0i X u • • ♦, X v . 1 are in \a — x\ < p. By (15) 
and (16) 

|« - *,l * Ifeto - *(*»-1)] - Vil < |g(«) - g(X>- i)l + 8. 

From (9b), we then have 

|ce — X v \ < A|a — X v _ 1 \ + S 

< A 2 |a — jr v _ 2 | 4" AS + S 

< A 3 1 a - X v . 3 \ + A 2 S + AS + 8 


< A v |a - X 0 \ 4- A V - X S 4- * * • 4- AS 4- S 

^AVo + ^8 

- AVo + T~r-\ ~ y r="A 


< Po + 


Thus all iterates lie in |a — x\ < p and the iteration process is defined. 
Moreover, from the last inequality involving v we find the estimate (18) 
which completes the proof. ■ 

Theorem 3 shows that the method is “as convergent as possible,” 
that is, the computational errors which arise from the evaluations of g(x) 
may cumulatively produce an error of magnitude at most S/(l — A). It 
is also clear that such errors limit the size of the error bound independently 
of the number of iterations. Thus in actual calculations, it is not worth¬ 
while to iterate until A v p 0 « 3/(1 — A). In fact, if reasonable estimates of 
A, 8, and p 0 are known, it is an efficient procedure to have the two types of 
error term in (18) of the same magnitude; i.e., 

- r=r f 

The number of iterations required is then about 


" = los [(r=y/ losA 


f £ reads “approximately equals” and we have tacitly assumed that 3/(1 — A) « p 0 . 
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Of course, if the acceptable error is much greater than 8/(1 — A) the 
number of iterations given by (13) and (14) is the relevant estimate. It is 
essential to estimate 8 from the arithmetic calculations involved in evaluat¬ 
ing g(x). Also, note that since the X v need not converge any tests for the 
termination of the iterations should allow for this roundoff effect. 

1.2. Second and Higher Order Iteration Methods 

It is clear from the corollary to Theorem 2, that if at the root x = a, 

( 20 ) *'(«) = 0 , 

then the convergence should be quite rapid. Let this be the case and 
assume further that g"(x) exists and is bounded in some interval, 

|a — x\ < p, in which (9) is satisfied. Then for any x in this interval we 
have by Taylor’s theorem: 

«(*) - <K«> + o + a ~ ' 4 ‘ g'tt). 

Here f is some value between x and a. By using this result we obtain 
for any iterate (2) (assuming |a — x 0 \ < p and ip\g"(x)\ < A < 1): 

(21) |*, - «| = |g(jc,_i) - g(a)| = |ig"(fv-i)| • |*v-i - “I 2 , 

v = 1, 2, 3,.... 

Thus, the error in any iterate is proportional to the square of the previous 
error and if g"(a) ^ 0 the procedure ( 2 ) is now called a second order method . 
Let the bound on g"(x) be denoted by 

(22) [g"(x)| < 2 M, |« - x\ < P . 

Then from (21) 

[AT, — a| < Af [jC w -1 ~ «| 2 

< M M 2 \X V _ 2 - a [ 4 

< M■ M 2 ■ M i \x v _ 3 - al 8 

(23) 

< M (2 ”- 1 >|x 0 - a| 2V 

< ( M\x 0 - «[) 2,-1 |*o - «|. 

Thus, if M\x 0 — a\ < 1, the second order method converges and reduces 
the initial error by at least 10 _m when 

(M |jc 0 - al ) 2 '" 1 S 10 - m . 
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The number of iterations required is now obtained from 


log (M |x 0 - «|)’ 

(24) 

^ 1 1 m 

v = 1^2 l0g log l/(M\x 0 -*\) 


A comparison with first order schemes is possible, i.e., the estimates in 
(12)—(14), if we assume A = M\x 0 — a\. That is, by letting v a) and v (2) 
represent the exponents v in (10b) and (23), respectively, we have equal 
reduction of the error if 


(25) 2 v ( 2 > — 1 + ^(i). 

For instance, 130 iterations of the first order scheme are equivalent, under 
the above assumptions, to about 7 iterations of a second order scheme! 
A further striking property of second order schemes can be obtained by 
assuming for all v > v 0 that 

\x v — a| = 10" p v, p v > 0; 

i.e., p v is essentially the number of correct decimals in the vth iterate. 
Then from the first line of (23): 

10" p v < M10- 2p v-i, 

and upon taking the logarithm of both sides 

(26) p v > 2/? v _! - log M. 

Thus, if M < 1, then —log M > 0 and the number of correct decimals 
more than doubles on each iteration. (If M > 1 the number does not quite 
double but, since p v » log M for large v, this doubling is at least asymp¬ 
totically true.) 

Schemes which are more quickly convergent than second order ones are 
now easily described. Let us assume that at a root x = a of (1): 

(27a) g'(«)=g»=*- = ^- 1) («) = 0; 

(27b) g (n) («) 7 ^ 0; k (n) W| < n\M in \x — a\ < p. 

Then by Taylor’s theorem, 


g(x) = a + — n] a) g ln) (Q, I* - 0t| < p 
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where £ is between * and a. Again from (2) and the above 


(28) 


|*, + i ~ «| = ^ 5 ^ I*. - «|" 


< M ■ |x v — cc\ n . 


The method (2) under conditions (27) is now called an nth order procedure 
and one can easily deduce the results corresponding to (23)—(26) for such 
methods. In the event that g(x) is calculated with an error of magnitude 8 
as in (15), the root a may be determined only to within an uncertainty 
of at best ± 8. This conclusion follows by letting A —^ 0 in (18). 


PROBLEMS, SECTION 1 

1. Given g(x) = x 2 — 2x + 2. For what values x 0 does (2) converge? 
[Hint: Use Theorem 2.] What is the order of the convergence? Sketch a 

graph analogous to Figures 2a and b. 

2. For^(x) = cos x, show that x n + 1 = g(x n ) defines a convergent sequence 
for arbitrary x 0 . Calculate the root a = cos a to three decimal places. 


2. SOME EXPLICIT ITERATION PROCEDURES 

The general problem to which the previous iteration methods are to 
be applied is that of finding the root (or roots) of 

(1) /(*) = 0 

in some interval, say a < x < b. Let <f>(x) be any function such that 

(2) 0 < |^(jc)| <oo, a < x < b. 

Then the equation 

(3) x - g(x) = x - 4>{x)f(x\ 

has roots which coincide with those of (1) in the interval [ a , b] and no 
others. Many of the standard iterative methods are obtained for special 
choices of <f>(x). 

Another procedure for defining the function g(x) is to use 

(4) g(x) = x - F(f(x)), 
where F(y) is a function such that 

F(0) = 0; F(y) / 0, y # 0. 

Such methods more naturally describe many higher order schemes. 
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2.1. The Simple Iteration or Chord Method (First Order) 

The simplest choice for <f>(x) in (3) is to take 

(5) <f>(x) = m / 0. 

If f(x) is differentiable, we note that 

(6) g'(x) = 1 - mf\x), 

and the scheme will be convergent, by the corollary to Theorem 1.2, in 
some interval about a provided that m is chosen such that 

(7) 0 < mf\a) < 2. 

Thus m must have the same sign as/'(«)> while if/'(«) = 0, (7) cannot be 
satisfied. 

The choice (5) yields the iteration equations 

*v + 1 = x v - mf{x v ). 

These iterates have a geometric realization in which the value jc v + 1 is the 
jc intercept of the line with slope \jm through (x v ,f(x v )). (See Figure 1.) 
The inequality (7) implies that this slope should be between oo (i.e., 
vertical) and i/'(«) [i-e., half the slope of the tangent to the curve y — f(x) 
at the root]. It is from this geometric description that the name chord 
method is derived—the next iterate is determined by a chord of constant 
slope joining a point on the curve to the jc-axis. 

2.2. Newton’s Method (Second Order) 

If the slope of the chord is changed at each iteration so that 

(8) g(*v) = 1 - m v f'(x v ) = 0, 



Figure 1 
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then a second order procedure may be obtained. From (8) we find 


(9) 


m v — 


1 


f\x v ) 


which suggests the choice in (3) of 

( 10 ) 

The resulting iteration procedure is now 
(11) x v + 1 = x v 


#* )= 7w or g{x) = x ~r§) 


/(*v) 


and it is at least of second order if /'(a) ^ 0 and f "( x ) exists, since 


( 12 ) 


8U [/ x «)] 2 


The geometrical interpretation of the scheme (11) simply replaces the 
chord in Figure 1 by the tangent line to y — f(x) at (x v , f(x v )). 

In applying Newton’s method, we are required to evaluate f\x v ) as 
well as f(x v ) at each step of the procedure. For sufficiently simple functions, 
which are given explicitly, this may offer no serious difficulty. (This is 
especially true for polynomials whose derivatives are easily evaluated by 
synthetic division; see Subsection 4.1.) However, if f(x) is known only 
implicitly (say as the solution of some differential equation in which x 
is a parameter in the initial data), it may be impractical to evaluate f'(x v ) 
at each iteration. In such cases the derivative may be approximated by 
various methods, the most obvious approximation being 


03) 


/'(*>) = fM ~ /(x> - l) - 


If this approximation is used, the procedure is no longer Newton’s method 
but is the method of false position discussed in the next subsection. 

A useful observation on the application of Newton’s method, or the 
false position variation of it, is based on the fact that as the iterations 
converge, f'(x v ) or its approximations converge to /'(<*)• Thus, for all 
iterates v > v 0 , say, it may suffice to use f\x vo ) in place of f'(x v ) in (11). 
The iteration method from this point on is then just the chord method 
with 1/m =f(x Vo ). 

It should be noted that Newton’s method may be undefined and con¬ 
dition (2) violated if f'(x) = 0 for some x in [a, b]. In particular, if at the 
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root x = a, /'(«) = 0, the procedure may no longer be of second order 
since (12) is not satisfied. To examine this case we assume that 

(14a) f[x) = (x - a) p h(x ), p > 1 

where the function h(x) has a second derivative and 

(14b) h(a) ^ 0. 

From (14) in (10) we find that 


g'ix) = 



4 - (x — a) 


2 h\x) 
pKx) 


4- (x — 



a) 2 


h\x) 

P 2 h{x) 


Thus for x 0 sufficiently close to a we have |g'(*)l < 1 for xe [x 0 , a] and 
the iterations (11) will converge. The asymptotic convergence factor is 
now 


k'(“)i = i - ~ 


So only in the case of a linear root , i.ep = 1, is Newton's method second 
order , but it will converge as a first order method in the general case (14). 
If the order of the root, p , is known (or can be closely estimated) quadratic 
convergence can be retained or approximated by the modification 


g(x) = x 



The details of this procedure are left to the reader. A convergence proof 
for Newton’s method which does not require f'(a) # 0 is contained in 
Theorem 3.3 [see also Problems (3)-(6) of Section 3]. 


2.3. Method of False Position (Fractional Order) 

If the difference quotient approximation to the derivative, given by (13), 
is employed in (11) we obtain the iterative procedure: 


(15) 


*v + l 


*v /(*v) 


X y X y — J 

f(x v ) -/(x v . 1 ) ; 


p = 1,2,.... 


It should be noted that two successive iterates, x 0 and x u must be esti¬ 
mated before the recursion formula can be used. However, only one func¬ 
tion evaluation, /(x v ), is required at each step since the previous value, 
/(x v _!), may be retained. [This is an advantage over Newton’s method 
where two evaluations, /(x v ) and /'(x v ), are required.] The order of this 
procedure cannot be deduced by the analysis of Section 1 since (15) 
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cannot be written in the scalar form x v + 1 = g(x v ). To examine this question 
let x — a be a root of f(x) — 0. Then we may write, by subtracting each 
side of (15) from a, 

y y _ J 


« - *V + 1 = (« - *v) + f(Xy) ■ 
= (“ - x v) J - 


(16) 


'/(**) ~/(*v-l) 

J[X V -U *„] ~ f[X v , tt] 

flXy- 1, *v] 

where we define f[a,b] = [f(b) — f(a)]/(b — a). This can be further 
simplified to the form 

by introducing 

/Tv V rvl — y[^V-l» *v1 ~ /[*v, °0 

J\X V -ly Xyy CC J - 

Ay _l ft 

Here we have anticipated the divided difference notation to be studied in 
Chapter 6 and in Problems 2 and 3 of this section. If the function f(x) 
has a continuous second derivative in an interval including the points 
* v , x v - ly and a, then it is shown in Theorem 1.1 of Chapter 6 that 

*v] =/'(fv) 

(17) 

/[*.-1» *v, a] = 

for some points and rj v in the obvious intervals. (See also Problem 3.) 
Thus we deduce that 


(18) 


a — x v 


zf"(v «) 


(a - X v )(a - Ar v _i); V = 1, 2,- 


2f(W 

Let us assume that all the iterates are confined to some interval about the 
root a and that for all f, r\ in this interval 

irwi 


(19) 


< M. 


mo\ 

Then by setting M\a — jt v | = e v we obtain the inequalities 
(20) e v + 1 < e v e v - 1 m 9 v ~ 1,2,..., 

upon multiplication of (18) with M and the use of (19). If we define 
max (e 0 , ^i) = b the inequalities (20) imply 

e 2 < S 2 
e 3 < S 3 

e 4 < 8 5 
e v < S m * 
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where m 0 = — 1 and m v + 1 = m v + m v _ u v = 1,2,_The numbers 

m v form what is known as a Fibonacci sequence . It may be shown (see 
Problem 1) that 

1 i + Vs 

(21) = - r v - +l ), t ± = 

Thus for large r: 

w v ss V(/- + ) v + 1 0.447(1.618) v + 1 . 

v5 

If 8 < 1, then the initial error is reduced by 10 ” m when S (m v _1) ~ 10” m 
and we may compare this number, v, of iterations with the corresponding 
numbers v {2) for the second order method and v a) for the first order method 
(see equation 1.25) in the case 8 = M\x 0 — a\ by noting that 

(22) m v - 1 = 2 v < 2> - 1 = v (1) , 

or 

7 ! ('■*)'*' - 2- '"- 

Hence 

(23) v = c 4- dv {2) 
where 


log a/5 
logr + 


1 ^ 0.672 


and 12&1. ~ 1.440. 

logr + 


We see that somewhat more of the current iterations are needed for a 
given accuracy than is the case for the second order methods (but it 
should be recalled that only one function evaluation per iteration is 
used). 

If we were to postulate that as v—^ao: |a — jc v + 1 | ^ K\a — jt v | r , then 
(18), with the coefficient \f"/(2f')\ = A/, would yield K = M llr , r — r + . 
In other words, we might say that the false position method is of order 
^ 1.618. Hence, two steps of Regula Falsi have an order =(1.6) 2 > 2.5 
and require only two evaluations of f{x). 

A geometric interpretation of the scheme (15) is easily given as follows: 
in the x,f(x) plane let the line through (x v ,/(x v )) and (x v „ x ,/(jt v _i)) 
intersect the x-axis at a point called jc v + 1 . In other words f(x) is approxi¬ 
mated by a linear function through the indicated pair of points and the 
zero of this linear function is taken as the next approximation to the desired 
root. Depending upon the location of the points in question this procedure 
may be an interpolation or an extrapolation at each iteration. 
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In the classical Regula Falsi method, the point (x 0 ,/(x 0 )) is used in (15) 
in place of (x v _ u f(x v _ 0) for all v = 1,2,.... The geometric interpretation 
of this scheme is quite clear, i.e., all lines pass through the original estimate, 
which is a poor strategy in general. Again either interpolation or extra¬ 
polation may occur. 

Finally, we may use in (15), in place of (x v _ x , /(* v -i))> the latest point 
for which the function value has sign opposite that of /(x v ). In this latter 
method, only interpolation occurs, and furthermore, upper and lower 
bounds for the root are obtained, which is ideal for estimating the error. 
However, to start this scheme we must initially obtain such upper and 
lower estimates of the root and, of course, it is only applicable if f(x) 
changes sign at the root in question. This latter variation requires some 
additional testing and storage of data and hence is slightly more compli¬ 
cated to employ on a digital computer. 

From the geometric description of the method of false position a natural 
generalization is suggested. That is, we set 

*v + i = Pk. v(0), V = k,k + 1,.... 

where P ktV (f ) is the polynomial in / of degree k which passes through the 
k + 1 points (/(*„), x v ), (/(*,_ i), x,_i), . .(/(*„-*), Xy-k). Clearly, for 
k = 1, this is just the scheme (15). The construction of such interpolation 
polynomials is, in general, treated in Chapter 6, and Section 2 of that 
chapter is particularly suited for the present purpose. [We must inter¬ 
change x and / or else use inverse interpolation. Also, it is assumed that 
the function values /(x y ) are distinct.] It can be shown that these “multi¬ 
point” methods have orders 7] k which increase monotonically with k 
and that lim r) k — 2. We have seen that r} 1 ^ 1,618 so that no great 

k -* oo 

improvement over the method of false position can be obtained. For 
k = 2 or 3, the orders are close to 2. 

Another possibility along the above lines is to use 

x v + i = Pi, v(0)t v — 1,2,...; 

that is a vth degree polynomial in / through all the previous iterates 
(/(x y ), x y ), y = 0 ,l,...,!' is used to determine the (v + l)st iterate. 
Again, the iterative linear interpolation scheme of Chapter 6, Section 2, 
can be used for this purpose and it can be shown that the order of con¬ 
vergence is now 2 (for simple roots). 

2.4. Aitken’s 8 2 -Method (Arbitrary Order) 

This procedure is frequently presented as a means for accelerating the 
convergence of the functional iteration method based on (3). The method 
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can be described and motivated as follows: If x v is any number approxi¬ 
mating a root of (1) or (3), let x v+1 be defined by 

(24) *V + 1 = g(Xy). 

Then a measure of the “errors” in these two approximations, x v and 
x v + 1 , can be defined by 


(25) 


= g(x v ) - * v = x v + 1 - X V! 

^V+l — g(*v + i) Xy + l* 


Since for a root this error should vanish, i.e., e(a) = g(cc) — a = 
— <t>{a)f(a) = 0, we may seek x v + 1 by “extrapolating the errors to zero.” 
That is, the line segment joining the points (x v , e v ) and (x v + 1 , e v + 1 ) is 
extended to intersect the x-axis and the point of intersection is taken as 
x v + 1 . This yields the expression 


(26a) 


x v e v +1 x v + i<?v 
€ v + 1 € v 


= G(x v ). 


For actual calculations (26a) is usually written as 


(26b) 


- Xv 


and the evaluations proceed by using (24), (25), and (26b). 

From (24)-(26) we see that the & 2 -method can be viewed as functional 
iteration applied to 


(27a) 


x = G(x ), 


where 


(27b) 


rr , = xg(g(x)) - g 2 (x) 

K ' g(gW) - 2 g(x) + X 


[That is, from x 0 we obtain the same sequence of iterates x v by the pro¬ 
cedure described in (24)-(26) as is obtained from x v + 1 = G(x v ).] 

The functional iteration scheme applied to (27) is sometimes known as 
Steffensen's method . In fact, Aitken’s S 2 -methodf was originally proposed 
to convert any convergent sequence (no matter how generated), {x n }, 
into a more rapidly convergent sequence, {x n '}, by using 


(28) 


(gn + l ~ *n ) 2 

+ 2 2x n + i + x n 


t The denominator in ( 28 ) suggests the second difference notation S 2 . See equation 
(3.16a) of Chapter 6. 
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Several general applications of the S 2 -process are illustrated in Problems 
6 and 7. 

The function (27b) is indeterminate at the root x = a since g(a) = a. 
However, its value there is easily found by an application of L’HospitaPs 
rule, assuming g(x) to be differentiable at the root and g'(a ) / 1; 

C( x _ g(g( a )) + ccg f (g(a))g f (a) - 2g(a)g'(a) 

W g\g(«))g'(«) - 2 g\a) + 1 

_ « + «[ g '(«)] 2 ~ 2 ag \ a ) 

[g'(«)] 2 ~ 2 g\a) + 1 


The case g'(a) = 1 corresponds to a multiple root of (1) at x = a. How¬ 
ever, in this case too, it can be shown from (33d) that a = G(a). Thus, 
it follows that (27a) has roots wherever (3) has them. To show further 
that all roots of (27) are also roots of (3), assume that jc is any finite root 
of (27). Then there are two cases, either g(g(x)) — 2g(x) + x vanishes or 
not. If not, then clearing fractions in (27) is legitimate and yields 

[£(*) ~ x? = 0 . 

Thus, x is also a root of (3). If the denominator in (27b) vanishes, the 
numerator must also vanish (since x was assumed finite). Now observe 
that since the denominator vanishes, we may use 

xg(g(x)) = 2xg(x) - x 2 

and substitute in the numerator to find that again [g(x) — x] 2 = 0. In 
other words (27) has the same roots as (3). 

The order of the 8 2 -method is simply related to the order of the functional 
iteration applied to x = g(x). To derive this result, we assume that x — a 
is a root and that: 

(29a) g'(a) = g"(a) = • • * = £ (p-1) (°0 = 0; 

(29b) g^(a) =p\A ± 0; 

(29c) £ <p + 1) (*) exists in |x — a| < p. 

These conditions imply that g(x) determines a pth order method. By 
Taylor’s theorem and (29) for every € such that |e| < p: 

a(p + Ufa + Oe) 

g(a + e) = g(a) + AS + g {p \ ’ * P + 1 , 0 < 6 < 1; 

(30a) = a + AS + Be * +1 

— a + S. 
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g <p + 1) (a + 0 e ) 

(P+ 1)1 ’ 


8 = (A + Be)e p . 


Since A and B are bounded, we can pick e sufficiently small such that 
[8| < p and then as in (30a) 

(30b) g(a + 8) = a + AS* + B'& + 1 


_ g<p + i)(a + *8) 
(/>+!)! ’ 


0 < <f> < 1. 


From (30) in (27b) we obtain, with x = a + e and e ^ 0, 


G(a + e) = 


(« + e)g(cc + 8) — (« + 8) 2 
g(a + 8) — 2 (a + 8) + (a + e) 

§2 _ Ae8P _ + i 

“ e — 28 + AS” + B' S» + 1 ' 


There are two cases, p > 2 and p — l, to be considered. First, with 
p >2 equation (32) can be written as 

G{ a + e) = a - e 2p -\A 4- Be ) 2 


{- 


1 - A(A + 111!! ~ gXj + 

2(/4 + Be)*”- 1 + /4(/4 + BOV 2 ” 1 + B\A + Jfc)* + V a + * - 


It is clear that the bracketed expression approaches 1 as e approaches 0, 
and so the above may be written as 

(33a) G(a + e) = « - A V- 1 + 0(€ 2 ”), p >2. 

For the case p = 1, (32) becomes 

(33b) G(a + «) = <* — c 2 (y4 + Be) 

f (B - B'A) - B'Be \ 

' \(1 - A) 2 - (2 B - BA - B’A 2 )e + IB'BAe 2 + B'BVJ 

Now in general, if A ^ 1, the bracketed expression approaches B*/(l — A) 
as e approaches 0 since B' and B approach B* = g"(a)/2 and so (33b) can 
be written as 


( 33 C) G(a + c) = « - < 2 + 0 (e 3 ); p = 1 , g'(«) = ^ ^ 1 . 
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But, if A = g'(a) — 1 and a has multiplicity! m , then by Problem 4: 


(33d) 

for 


G{a + €) = a + ^1 - € + 0(e% 


P = h g'(«) = U 
S"( a ) = • * ■ = g (m_1) (a) = 0, g (m >(«)#0; 

for m = 2, 3,.... 

We now invoke a lemma which shall enable us to determine the orders 
and convergence properties in the cases represented in (33a-d). 

lemma 1. Let G(jc) be a function , with q + 1 derivatives in a neighborhood 
of x = a, such that 

(7(a) = a, 

and for any e sufficien tly small 

(34) G(a + €) - a + C€« + 0(e« + 1 ). 

Then 

G\a) = G"(a) = • • * = (7 (9-1) (a) = 0, (7 (Q) (a) - ?!C. 

Proof By Taylor’s theorem we have, for sufficiently small <r, 


(35) 


(7 (a + e) = a + -p (7'(a) + ■ * ■ + ~ (7 (q) (a) 
I * ^ • 


+ (^Tl)! G(9+1) ( a + 0<fl<l. 


The lemma follows by comparing, in the order k = 1, 2,.. q, the 
values obtained from (34) and (35) of: 


lim [- 

€-0 L 


G(a + e) — a 


k\ ■ 


By applying this lemma in (33a~d) we deduce the following 
theorem 2. (i) If functional iteration applied to (3) is of order p > 2 


f If the functions in (1), (2), and (3) have m derivatives, we may verify the equivalence 
of the statements: 

(i) a is a root of f{x) of multiplicity m > 2; 

(ii) /(«) =/'(«) = • • • =/<*-»(«) = 0, /<">(«) * 0; 

(iii) £(°0 = a, ^'(«) = 1, g"(«) = * * * = £ (m-1) («) = o, £ (m) («) ^ 0. 

Statements (i) and (ii) are equivalent by definition. The equivalence of (ii) and (iii) 
follows from Leibnitz’ rule, 

W) {k) = <f> (k) f + + ■ • • + k<f>T k ~ iy + <£/ (fc) , 

and the fact that <f> ^ 0, by induction on k = 1 , 2, .. m. 
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for some root a of { 1), then the & 2 -method (24)-(26) is of order 2 p — 1 for 
this root. 

(ii) If functional iteration in (3) is of first order {but not 
necessarily convergent) for a simple root a of { l), then the h 2 -method is of 
second order for this root. 

(iii) If as in (ii), the root a of (l) has multiplicity m > 2, 
then the S 2 -method is first order with asymptotic convergence factor l — 1/m. 

Proof Part (i) follows from (29), Lemma 1, and (33a). Part (ii) follows 
from Lemma 1 and (33c) since g'( a ) ^ 1 is equivalent to /'(a) ¥= 0 (i.e., 
that the root is simple). Finally, (iii) follows from Lemma 1 and (33d) 
since an m-foldf root of f{x) = 0 at jc = a implies g'(a) = I, g"(°0 — ■ • ■ — 
g (m “ X) (a) = 0 and g (m) («) # 0. (This proof has assumed that f(x), g(x) and 
G(x) have as many derivatives as required.) ■ 

From this theorem, it follows that in all cases Aitken’s 8 2 -method con¬ 
verges if |a — x 0 \ is sufficiently small. Furthermore, it is always at least 
of second order for simple roots. It is clear that this method can be quite 
effective and it, or generalizations of it described below, may be very 
profitably used in practice. 

Iterations which converge even faster than the S 2 -method are naturally 
suggested by the above “derivation” of (26). One such generalization is to 
consider the set of more than two errors associated with jc v , jc v + 1 , ..., 
x v + u as defined in (24) and (25), say 

(36) e v , e v + iy... y e v + tl9 p > l, 

and then determining x v + 1 such that this set of errors is “extrapolated” 
to zero. The details of such a procedure require a knowledge of poly¬ 
nomial interpolation which is discussed in Chapter 6 (see Section 2 in 
particular). The main point in the correct application of this procedure 
is to consider the x v as functions of the e v (i.e., inverse interpolation) in 
which case the approximation x v+1 can be computed directly by evaluating 
at e = 0 the polynomial of pth degree in e that takes on the values x v 
at e v , and x v + k at e v + k for 1 < k < p. Other generalizations of these 
procedures can be obtained by successively increasing the value of p. 
These considerations are, in fact, the same as those in Subsection 2.3 
where generalizations of false position were discussed. Another ob¬ 
vious type of modification is described by introducing <7 (0 ) (jc) = g(x), 
G (1) (x) = (7(x) and then forming G (? %*) by recursive application of (27). 

It should be noted that the correction in (26b), —e v 2 /(e v + 1 — e v ), is 
the quotient of very small quantities. The denominator, being a difference 


t See previous footnote. 
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of small quantities, may require multiple precision evaluation of g(g(x v )) 
and g(jc v ), in order not to lose too many significant figures, especially if 
g'(a) ~ 1 (i.e., if the root is multiple or nearly so). For these reasons it is 
important to determine an appropriate 8 which can be used in Theorem 
1.3 in estimating the effect of errors in the 8 2 -method. 


PROBLEMS, SECTION 2 


1. Solve the recursion m v + i = m v + /w v -i, v — 1,2,..., where m 0 = m x 
= 1. (Try a solution of the form m v = r v which leads to a quadratic with roots 
r±. Then set m v = ar + v + br- v and determine a and b from v = 0, 1.) 

2. The second divided difference of a function f(x) is defined by: 


f[x u x 2 , x 3 ] 


f(x i) - /( x 2 ) _ f(x 2 ) - f(x 3 ) 
*1 - x 2 _ x 2 - x 3 

Xi - *3 


The first divided difference is just the difference quotient. Use these definitions 
to verify the derivation of (16). 

3. Verify (17). 

[Hint: If f"(x) is continuous in an interval containing x u x 2 , and x 3t then 
the second result in (17) can be derived by the expansion, via Taylor’s formula 
with remainder, of f(x J and f(x 3 ) about x 2 plus the fact that a continuous 
function takes on all values between any two of its values. (Assume, with no 
loss in generality, that x x < x 2 < x 3 in the definition above. That is, it is 
easy to verify f[x u x 2 , x 3 ] = f[x h x h ^r te ] where (/,y, k) is any permutation of 
(1, 2, 3). This is a special case of a property established in Chapter 6, Section 1, 
namely that the divided difference is a symmetric function of its arguments.)] 


4. Let g(a) = a, g'{a) = 1, g"(a) = g m (ct) = • • • = g m H«) = 0, and 
g m (cc)^ 0. Then if we assume g has derivatives of order 2 m y g(a + e) = 
a + e + Be 2 where m > 2 and 




g (m + i>(o) 
(m + 1 )!' 


+ • • • 


and similarly, with 8 = 6 + Be 2 then g(a + 8) = a + 8 + £'S 2 , where 


B'(8) = 8” 

ml 


Now observe that 

and therefore 


-z g <m + u (a) g m _, 

(m + 1)! 


BB = [ g ^j a) ] 2f2m ~ 4 +•" 

B’ - B = (m - 2)[ ^ 8) ]>-a + • • •. 

Hence, show that formula (33b) yields the results of (33d) for m - 1, 2, 3,.... 
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5. Verify that the functional iteration scheme is divergent for (a) g(x) = 
* + x 3 and (b)£(x) = 2x + x 3 . Nevertheless, as stated in part (ii) of Theorem 
2, the Aitken 8 2 -method is convergent and 

(a) G(x) = x- 3 + 3 J + X 4 = ix + 0(* 3 ) 

(b) G(x) = 6x 3 + 0(x 5 ). 

6. (Aitken's 8 2 -process). Let {x n }, n = 0, 1,2,..., converge to a; so that, 
for some constant b , 

r n = x n — a ^ 0, n > N; 

r n + i = (b + € n )r n> |6| < l,e n = o(l).t 

Show that (28) is meaningful for n > N> i.e., 

Xn + 2 - 2x n + 1 + x n ^ 0 for n > N; 

and that 

lim isLl* = o. 

n-+oo X n — 0£ 

[Hint: Verify that 

x n + 2 ~ 2x^ + 1 4- x n = r n + 2 — 2r n+ i + r n 
= r n [(b - l) 2 + 0(1)]. 

Also show from (28) that 

lb - 1 + o(l)] 2 

a - r " rn (b - l ) 2 + O(l) 

= r*o(l).] 

7. Apply Aitken’s 8 2 -process (28) to the sequence 

x n = a + bp! n + cp 2 n > n = 0, 1, 2,. . 
where \p 2 \ < \pi\ < 1. Show that 

x n ' = « + <P(p 2 n ) + &(pi 2n ). 

What improvement results by applying the S 2 -process to the sequence {x n '}7 


3. FUNCTIONAL ITERATION FOR A SYSTEM OF EQUATIONS 

Let x be an ^-dimensional column vector with components x u x 2 ,..., x n 
and g(x) an rc-dimensional vector valued function, i.e., a column vector 
with components g^x), g 2 (x), .. ., g n (x). Then the system to be solved is 

(1) X = g(x). 

t We write 8 n = o(l) iff there is some number N such that 8 rt is defined for all n > N 
and lim $ n = 0. In the text, we also use o(l) as a generic symbol to represent the 

ft-* ® 

members of any sequence which tends to zero. 
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The solution (or root) is some vector, say a, with components a u a 2 ,..., 
a n which is, of course, some point in the ^-dimensional space. Starting with 
a point x (0) — [x { !°\ x { £\ . .xi l 0> ] r , the exact analog of the functional 
iteration of Section 1 is 

(2) x (v + d = g(x (v) ), ^ = 0, 1, 2. 

The first result is analogous to Theorem 1.1. But where absolute values 
were used previously, we must now use some vector norm (see Chapter 1, 
Section 1). For example, we may choose any one of the norms 

||x||oo s max \xt\, 

1 < f <n 

(3) Wli s 2 

i = 1 

wu - 7 2 w*. 

y i= 1 

theorem 1. Let g(x) satisfy 

(4a) ||g(x) - g(y)|| < A||x - y|| 

for all vectors x, y such that ||x — x <0> || < p, ||y — x (0> || < p with the 
Lipschitz constant , A, satisfying 

(4b) 0 < A < 1. 

Let the initial iterate , x (0) , satisfy 

(5) ||g(x <0) ) - x<°»|| < (1 - A )p. 

Then: (i) all iterates , (2), satisfy 

||x (v) - x (0 >|| < p; 

(ii) the iterates converge to some vector , say 

lim x (v) = a, 

V —» 00 

which is a root of( 1); 

(iii) a is the only root of{ 1) in [|x — x (0) || < p. 

Proof Duplicate the proof of Theorem 1.1 with the replacement of 
absolute value signs by norm symbols. ■ 

As a consequence of this proof, it is also seen that the iterates converge 
geometrically, and at least as fast as A v -> 0. Of course, it is more difficult 
to verify (4), the Lipschitz continuity of a vector valued function, than 
it is in the case of a scalar function. 
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However, again as in Section 1, a more useful result can be obtained 
if we are willing to place more restrictions on g(x) and assume the existence 
of a root. We immediately see that Theorem 1.2 and its proof hold if 
absolute value signs are replaced by norms. Furthermore, the corollary 
to Theorem 1.2 becomes 

theorem 2. Let (1) have a root x = a. Let the components g<(x) have 
continuous first partial derivatives and satisfy 


( 6 ) 

for all x in 

(7) 


d&(x) < 

dxj ~ n 


A < 1; 


||x - all* < p. 


Then: (i) For any x (0) satisfying (7) all the iterates x (v) of (2) also satisfy (7). 

(ii) For any x (0) satisfying (7) the iterates (2) converge to the root a 
of (1) which is unique in (7). 

Proof For any two points x, y in (7) we have by Taylor’s theorem: 


(8) g ( (x) - gi( y) = 2 - ( x > - 

i = l 


= 1 , 2 . 


where % (i) is a point on the open line segment joining x and y. Thus, 
5 (1) is in (7), and using (3) and (6) yields 


|gt(x) - &(y)l ^ 2 


i= 1 




dXj 

^ ll x - ylU 2 

; = 1 

< A||x - y||oo- 

Since the inequality holds for each i, we have 
(9) ||g(x) - g(y)|U < A||x - y||o 


x < - y t I, 

8gi(% w )\ 


dXi 


and thus we have proven that g(x) is Lipschitz continuous in the domain 
(7), with respect to the indicated norm. Now note that for any x (0) in (7), 

||x (l) - aIIoo = ||g(x <0) ) - g(a)||o„ 

< A||x <0) - a || oo 

< A p, 
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and so x (1) is also in (7). By an obvious induction we have then 

|x <v> - a|oo = !|g(x (v - u - g(«)|oo 
< A || x <v _ 11 - a || „o 


( 10 ) 


< A v || x (0) - a || oo 

< A 


and hence all x (v) lie in (7). The convergence immediately follows from 
(10) since A < 1. The uniqueness follows as before. ■ 


The crucial point in the preceding proof is the derivation of (9). It is 
clear from this derivation that (6) could be replaced by a number of 
conditions which are perhaps less restrictive and the theorem would still 
remain valid. One such condition is 


(11) max 2 |g«(x)| < A < 1, for all ||x - a||„o < p, 

1 ; •= 1 

where we have introduced the elements g i; (x) = dg t (x)/djty. If we define 
the matrix G(x) = (g w (x)) then (11) may be written as ||G(x)|| 00 < A < 1 
in which case we mean the natural matrix norm induced by the maximum 
vector norm (see Chapter l, Section 1). 

If the function g(x) is such that at a root 



and these derivatives are continuous near the root, then (6) as well as (11) 
can be satisfied for some p > 0. If, in addition, the second derivatives 

8 2 g,(x) 
dx y dx k 

all exist in a neighborhood of the root, then again by Taylor’s theorem 

S‘ (x) - *' (a) = \ % 2 to - - «*). 

Now in the iteration, (2), we find 

|x (v ' - a||a, < M ||x (v_1) - a||2>, 
where M is chosen such that 


^i(x) < 1M 
dXj dx k n 2 


max 
i.y. k 
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Thus, quadratic convergence can occur in solving systems of equations by 
iteration. 

3.1. Some Explicit Iteration Schemes for Systems 

In the general case, the system to be solved is of the form 

(13) f(x) = o 

where f(x) = [/!(x),/ 2 (x), .. .,/ n (x)] T is an ^-component column vector. 
Such a system can be written in the form (1) in a variety of ways; we 
examine here the choice 

(14) g(x) = x - ^(x)f(x), 

where A(x) is an «th order square matrix with components tf f; (x). The 
equations (1) and (13) will have the same set of solutions if A(x) is non¬ 
singular [since in that case T(x)f(x) = o implies f(x) = o]. 

The simplest choice for A(x) is 

(15) A(x) = A, 

a constant non-singular matrix. If we introduce the matrix 



whose determinant is the Jacobian of thefunctions/J(x), then from (14)-( 16) 
we have 



Thus by Theorem 2, or its modification in which (11) replaces (6), the 
iterations determined by using 

X (V+1) _ x (v) _ ,4f( x <v)) 

will converge, for x (0) sufficiently close to a, if the elements in the matrix 

(17) are sufficiently small, for example, as in the case that /(a) is non¬ 
singular and A is approximately the inverse of /(a). This procedure is the 
analog of the chord method and it naturally suggests a modification which 
is again called Newton’s method. 

In Newton’s method (15) is replaced by the choice 

(18) A(x) = J- 1 (x ) 9 

with the assumption of course that det |/(x)| # 0 for x in ||x — a|| < p. 
In actually using the above procedure, an inverse need not be computed 
at each iteration; instead, a linear system of order n has to be solved. To 
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see this, and at the same time gain some insight into the method, we 
note that by using (18) in (14) the iterations for Newton’s method are: 


(19a) 

From this we obtain 
(19b) 


V = g( X <v)) ) 

= x (v) - J - 1 (x' v) )f(x <B >). 
/(x <v) )(x <v) - x <v + 1) ) = f(x (v) ), 


which is the system to be solved for the vector (x (v) — x (v + 1) ). 

To show that this method is of second order we must verify that (12) 
is satisfied when (18) is used in (14). Theyth column of G(x) is then given 


dxj 


dXj 


By setting x = a in the above and recalling that f(a) = o and J = ( df/dxj) 
we get 

G(a) = I - (a) -0 = 0. 

To determine 3J ~ 1 (x)/dx j , note that 

Kf~ l J) j-i M | 8 JSlj = fL = o 


and hence 


8x 
8J ~ L (x) 


dXi 


dXj 


dxj 




Thus, we need only require that f(x) have two derivatives and J(x) be 
non-singular at the root , and then the convergence of Newton's method is 
quadratic. 

For a geometric interpretation of Newton’s method we consider a 
system of two equations and drop subscripts by using 




( x '\ s ( x \ I 


X 2 )\ _ 




\* 2 / W \ 

Jz(x 1, 

X 2 )J ~ 

\£(*, y)/ 

Then 








J(x) = 

= /fx 

f.) 






gyJ 


and the system (19) can be written 

as: 



(20a) 

(*v + l 

x v )f x (x v , y v ) + (y 

^ + i — 

y v)fy(x v . 

yv) + fix\ 

(20b) 

(*v + l 

- x,)g x (x„, y v ) + (y. 

/ + i “ 

y,)g y (x v . 

yv) + g(x, 
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In the (x, y, z)-space the equations 

(21a) z (x Xv)fx(.x v > yv) T i.y y v )fy(x v ,y v ) T f{x V9 yfi 9 

(21b) z = (x - x v )g x (x vi y v ) + (y - y v )g y (x Vf y v ) 4- g(x v , y v ), 

each represent planes. The plane (21a) is tangent to the surface z = /(x, y) 
at the point (x v , }\,/(x v , }> v )), and the plane (21b) is tangent to z = g(x, y) 
at the point (x v , y v , g(x v , >> v )). Clearly, the point (x v + 1 ,j> v + 1 ) determined 
from (20) is the point of intersection of these two planes with the plane 
z = 0, i.e., the (x, _y)-plane. Thus, in passing from one dimension (Section 
2.2) to two dimensions, Newton’s method is generalized by replacing 
tangent lines with tangent planes. In the more general case of n dimensions 
the obvious interpretation, using tangent hyperplanes, is valid. Each of the 
equations 

z = 2 ( Xk ~ **”) — IT- + yi(x <v) ). ‘ = 1,2 

k =1 GX k 

represents a hyperplane in the (x : , x 2 , •. -, x n , z) space of n + 1 dimensions 
which is tangent at the point (x ( i v) , x ( 2 v) ,..., x ( n v) ) to the corresponding 
hypersurface 

z =f{x u x 2 ,...,x n ). 

The difficulties which may arise in the solution of systems using Newton’s 
method can be interpreted by means of these geometric considerations. 

3.2. Convergence of NewtonN Method 

If the initial iterate x (0) is sufficiently close to the root a of f(x) = o, 
then Theorem 2 can be used to prove that the Newton iterates, x (v) , 
defined in (19) converge to the root. In addition, if the Jacobian J(x) 
is non-singular at the root, x = a, and differentiable there, then the 
convergence is second order. However, we do not know from these results 
if a given initial iterate x (0) is close enough to the unknown root, a. We 
shall develop a sufficient condition, under which Newton’s scheme con¬ 
verges, with the property that this condition may be explicitly checked 
without a knowledge of a. In fact, the theorem to be established also 
proves the existence of a unique root of f(x) in an appropriate interval 
about the initial iterate, x (0) . Thus, we have an alternative to Theorem l 
which we state as 

theorem 3. Let the initial iterate x (0) be such that the Jacobian matrix 
J(x (0) ) defined in (16) has an inverse with norm bounded by 

||y- 1 (x«”)ll <«• 


(22a) 
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Let the difference of the first two Newton iterates be bounded by 
(22b) ||x (1) - x (0) || = ||y- 1 (x (0) )fU (0) )ll < b. 

Let the components of f(x) have continuous second derivatives which satisfy 


(22c) 


2 

k = 1 


e 2 /i(x) 

dXj dx k 


< 


c 

— 5 

n 


for all x in ||x — x (0) || < 2b; i,j = 1,2 If the constants a , b , and 
c are such that 

(22d) abc < \ 

then: (i) the Newton iterates (19) are uniquely defined and lie in the 
“ 2b-sphere ” about x (0) : 

||x (v) - x (0) [| < 2b; 

(ii) the iterates converge to some vector , say lim x (v) = a , for which 

V —► 00 

f(a) = o and 

(23) |x‘« - «|| < % 


[All vector norms in the statement and proof of this theorem are maximum 
norms, i.e., ||x|| = max^ |jc,|, and matrix norms are the corresponding 


induced natural norm, i.e., [|y4|| = max 

t 



■] 


Proof. The proof proceeds by a somewhat lengthy induction. For 
convenience, we use the notation J v = 7(x (v) ) for the Jacobian matrices 
(16) and show for all v = 0, 1, 2,.. . that with A v+1 = I— J v ~ 1 J v+U 


(24a) 

|| X (V + i) 

VI 

> 

1 

(24b) 

||x (v + 1) 

- X (0, || < 2b, 

(24c) 

M»+il 

= \\Jv~Vv - 

(24d) 

IIA“+ill 

= || (/ - Ay +1 , 


From the hypothesis (22b) it trivially follows that (24a, b) are satisfied 
for v = 0. Now when (24b) is established up to and including any value v 
then x (v + 1) and x (v) are in the 26-sphere about x (0) in which we are assured 
that the second derivatives of the/,(x) are continuous. Then we can apply 
Taylor’s theorem to the components of J v + 1 to obtain 


S/Xx^D) 3/,(x<“>) * r(vn [x <v) + 0,(x'* +1 > - X‘*>)], 

dx, ~ dx, + A K * k ’ 8 Xj 8x k 


0 < 9, < 1. 
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Since x (v+1) and x (v) are in ||x — x (0) || < 2 b, so is the point 
x (v) + 0(x (v + 1) - x (v) ), and (22c) applies. This gives from the above 

(25) \\J V + 1 -Jvll * c\\x (v + 1) - x (v) ||. 

At the present stage in the proof this is valid only for v — 0. But then using 
this and (22a, b, d) in (24c) with v — 0 yields 

Mill ^ Mo- 1 ll-Mi - All 

< ac|x (1) — x (0) || 

< abc 

Now (24a, b, c) have been established for v = 0. 

If for any v the matrix J v is non-singular, then we have the identity 

J v + i = 7 V (/ /t v + 1 ), 

where, as in (24c), A v + l = J V _1 (J V — J v + 1 ). But from the Corollary to 
Theorem 1.5 of Chapter 1 it follows that */ ||^ v + 1 || < 1 then J v + 1 is 
non-singular and 

<“> ‘‘I £ T^feTTT 

Since (24c) is valid for v = 0 we can use this in (26) to get 

IIA - 1 II ^ 2fl. 

Thus (24) has been verified for v = 0. 

Let us now make the inductive assumption that (24) is valid for all 
v < k — l and proceed to show that it is also valid for v = k. Since J k 
is non-singular, the (k + l)st Newton iterate, x (fc+1) , is uniquely defined 
and we have from (l 9a): 

(27) ||x< fc+1 > - x< fc >|| = ||A- 1 f(x< k >)|| 

< ||A- 1 ||-||f(x<«)||. 

However, since (24b) is valid for v = k — 1, the point x (k) is in the 
2Z?-sphere about x (0) . Then by Taylor’s theorem, with remainder term 
R, and (19b) with v — k — 1: 


f(x (fc) ) = f(x (fc_1) ) + J k -i[x (k) — x ik 1} ] + R(x (fc) , x ik 1} ) 
= R(x (fc) , x (/c_1> ). 
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Using (22c), we can bound the above remainder term to yield 

||f(x (fc) )|| = max |/?i(x (/c) , x (te_1) )| 
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= max 


J J (xf - X?-V)(x\ k) - x?- 1 '*) 

I/ = 1 1 = 1 

. 8 2 f 


2! 


dXj dx t 


[x ik ~» + 4(x< te) - x*" 1 ’)] 


0 < <f>i < 1 , 


(28) < ^ \\x ik) — x (fc 1) |[ 2 . 

Again, we have used the fact that x (te " 1) + <f>(x {k) — x^ -1 *) is in the 
26-sphere about x (0) since x (fc) and x (k_1) are in it. Now using (28) in 
(27) and recalling that (24) is assumed valid for all v < k — 1 we get 


(29) 


x <fc +1) 


■"l * 2 114-111 


|| x «0 _ X ( k - D||2 



_ ab 2 c 
~ 2^1 


Thus (24a) is established for v = k. Then since 


K< fc + D - X (0) || = 


2 (X (, + 1) - x a >) 


/ = o 

k 


< 2 ii x<l+1) - x<,) n 

i = o 

k i 

l = 0 ^ 

< 2b, 

we have also established (24b) for v — k. But then x (fc + is in the 26-sphere 
about x (0) and so (25) is valid with v = k. This gives 


Mk + i|| — 14 : (4 — 4+i)ll 

< IIVI-II^I-AII 

< c||A _1 || • ||x (fc + u - x<»|| 

< abc 


^ i. 
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Thus (24c) is valid with v = k and implies that J k + 1 is non-singular. 
Then using (26) with v — k yields (24d), and the inductive proof of (24) 
is complete. 

Part (i) of the theorem follows from (24b, d). The convergence of the 
x (v) follows from (24a) since they form a Cauchy sequence: i.e., 


(30) 


K (v + m) _ x (.)|| - 


2 (x (,tl) - x (,) ) 


1 = V 

v + m - 1 


v + m — 1 i 

< 2 ||x (l+1) - x<'>|| < b 2 i 


< 


b 

2 V_1 * 


Calling the limit vector a, we use (24a), (28) and the continuity of f(x) 
to deduce that 


l|f(x ( ' 


2 b 2 c 
4 k ’ 


and lim f(x (fc) ) — f(a) = o. Letting m ->oo, (30) implies 

o o 



and so, part (ii) is established, concluding the proof of the theorem. ■ 
This theorem is valid if n = 1. The hypothesis permits the case that 
J(a) is singular. Hence, it is reasonable that the conclusion (ii) shows at 
most linear convergence. But (ii), moreover, implies that the convergence 
factor is at most This seems to contradict the fact shown earlier for the 
scalar case (n = 1), that the convergence factor is only 1 — 1/p if/(«) is a 
zero of order p > 1. The contradiction doesn’t exist because, as we show 
in Problem 6, the requirement abc < \ can only be satisfied if p < 2. 
[For example,/(x) = x p and x 0 ^ 0 satisfy 


a be > 


_I_ /(*o) 

f\x o) f\x o) 


■f"(X o) 


i - Up.] 


On the other hand, if h = abc < then Kantorovich has shown in 
general that 


I|X <V> - «ll ^ 


(2/t) 2> - 1 


i.e., quadratic convergence. See Problems 3, 4, 5, and 6 for further results 
in the scalar case. 
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3.3. A Special Acceleration Procedure for Non-Linear Systems 

If the non-linear system to be solved is written in the form 

(31) x = g(x), 
then the obviously implied iterations 

(32) x (0) = arbitrary; x (v+1) = g(x (v) ), v = 0, 1,..., 

may or may not converge. However, as with linear systems, we can alter 
the procedure (32) in a manner which will generally improve the rate of 
convergence or may even yield a convergent scheme when the basic one 
diverges. The acceleration procedure is defined by: 

x (0) = arbitrary 

(33a) 

x ( v+n _ @g( x (v) ) + (/ — ©)x (v) , v = 0, 1,..., 

where 0 is a diagonal matrix given by 

(33b) 0 = (*,«„), det [0] = e x 0 2 . ..0 n *0. 

Of course, if 0 = /, then the basic scheme (32) results. The scalar form of 
the ith equation in (33a) is clearly 

1} = + (1 - 1 < i < n, 

and so the iterations are easily evaluated in an explicit manner. 

Let us assume that (31) has a solution, say x = a, and that in some 
p- sphere about this solution, ||x — a|| < p, the vector function g has 
continuous first partial derivatives, g 0 (x) = dg^xj/dx^ which satisfy the 
conditions 

(34) 11 - g«(x)[ > ^ |g«X*)l. 1 < i < n. 

Under these conditions it can be shown that the iterations (33) will con¬ 
verge, for some choice of the 0 t , to a solution of (31) for any initial guess 
x <0) in ||x (0) — a|| < p. In fact, under slightly different assumptions we 
could even demonstrate the existence of a solution if ||g(x (0) ) — x (0) || 
is sufficiently small. However, we shall not present such specific theorems 
but instead shall indicate the relevant arguments and concern ourselves 
with the determination of the appropriate 6 { to be used in (33). These 
considerations, in turn, suggest a modification in which the fyare changed 
at each iteration. 

If the error vector after the eth iteration is denoted by 

e (v) = x (v) _ a 

then from (33a): 

e (v + 1 > = 0[g(x (v) ) - g(a)] + (/ - 0)e <v) . 
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However, by Taylor’s theorem we then have 

(35) e (v+1) = (/-© + 0G v )e (v) = JV/ v e (v) 

where G v = (g tJ ) and the ith row of G v is evaluated at some point £ (< * v) = 
x (v) — <f>i(x v — a), 0 < <f>i < 1, for i = 1, 2Clearly, the iterations 
will converge if the coefficient matrices in (35) satisfy ||M V || < q < 1 for 
all v. Now if we use the maximum vector norm and corresponding induced 
matrix norm, these inequalities are satisfied if 

(36) Ri(d t ) » |1 - 0 t [ 1 - &,(!•“• v, )]| + |«,| 2 kiX5“' v) )l *4<U 

1 < i < n; v — 0, 1, 2,.... 
Thus, we are led to consider inequalities of the form 
R(0) s |1 - 6a\ + \6\b <1, b > 0. 

It is easily shown that R(0) < l if |<z| > b and 6 is in the interval; 

(37a) 0 < 6 < ———r if a > b\ 

a + b 

(37b) —3— < 0 < 0 if -a > b. 

a — D 

Furthermore, in each of these intervals the minimum value of R(8) is 
attained at 

(37c) e* = and R * = R{6*) = - • 

a a 

These results are easily deduced by considering the graphs of \9\b and 
|1 — 0a\ as functions of 6. It is also clear from such graphs that under¬ 
estimates of 10*| produce a smaller R than do overestimates (see Figure 
1 for the case 1 > a > b > 0). 



Figure 1 
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Employing (37) in (36) we find that: The scheme (33) converges if (34) 
is satisfied and the acceleration parameters 0 f lie in the intervals 


(38a) 

2 

0 < 6i < i - gu(%) + /•,(?)’ 

if 1 - gli (%) > r,(5); 

(38b) 

1 - gu(Q - r,® <0 ' <0 ' 

if 1 - gill) < ~rll)\ 


where r*(5) = |g i; (5)|. [It should be noted that by the assumed 

i * t 

continuity of the g {J (x) and (34), if r t (x) ^ 0 then only one of the ranges 

(38) can apply for each i = 1, 2,.. n.] 

From a graph of any /?*(0 t ), it is clear that the value of 6 { should not 
lie near the end points of the intervals in (38). In fact, from (37c) it is 
suggested that 8 t — 1/[1 — gn(5)] is the “best” value for However, 
since this value depends upon a safe choice, Of, which can be rigorously 
justified is that for which (38) holds 

(39) \0 t *\ = min \% - «| < P . 

These considerations suggest a modification of the scheme (33) in which an 
approximation to the best 0 t is used at each step of the iteration. In fact, 
if x (v) is close to a solution then the values 


(40) 


e\ v) = 


i 

1 - gii(x (v) )’ 


i = 1, 2,.. n. 


can be used in (33), to replace the constants 0„ and this practical scheme is 
of the form: 


(41) x (v + 1) = 0 (v) g(x (v) ) + (/ - 0 <v) )x (v) , v = 0 , 1 ,.... 

In carrying out this procedure, we need only evaluate then partial deriva¬ 
tives, ga = dgi/dx u at each step to predict the appropriate 0\ v) . If these 
derivatives are not easily obtained or evaluated, one may frequently use 
difference approximations which just require one extra evaluation of 
each of the functions g*(x), i.e., since gt(x (v) ) is known we use 

_ /.,(vu ~ gi(A“\ ■■■, *1-1, *S V) + h, .yS v + \, ..., A v> ) - 

~^- 

with some suitably small value of h. 

Although the conditions (34) seem severe and perhaps unusual they are 
frequently satisfied in practice. In fact, many difference methods for solving 
non-linear boundary value problems in ordinary and partial differential 
equations result in such systems. For many of these systems, convergence 
can even be obtained for some choice 0 1 = 0 2 = ■ ■ ■ = 0 n = 0 (for example, 
see a related scheme in Chapter 8, Subsection 7.2). 
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PROBLEMS, SECTION 3 

1. State and prove a generalization of Theorem 1.3 for systems of equations. 

2. State and prove a version of Theorem 2 which employs a different norm 
(say | || 2 or || ||i). 

3. For n = 1, use the hypothesis of Theorem 3, with h = abc < \ and show 
directly that there exists a root a with |a — x 0 | < 2b. 

[Hint: Use Taylor’s theorem, 

/(*) = fiXo) + ix - x 0 )f'(x 0 ) + ( * -. * o) V (f) 

to show that 

fix 0 - 2b) fjx 0 + 2b) 

fix 0 ) - - f'(x 0 ) 

4. Under the same assumptions as in Problem 3, show that the root a 
with |a — * 0 | < 2b is unique. 

[Hint: If not unique, there exists 17 with \rj — * 0 | < 2b and f'(rj) = 0. 
But from Taylor’s formula 

show that f'(r}) ^ 0 .] 

5. Under assumptions of Problem 3, with h < show that f'(a) ^ 0 
(use hint of Problem 4). 

6 . Under the assumptions of Problem 3, if f'(a) = 0, show that |a — x 0 \ = 2b 
(by hint of Problem 4). Furthermore, if f\a) = 0, show that f"{a) ^ 0. 

[Hint: f"{cc) = lim f'iylli 7 } ~ «)> but by hint of Problem 4 and 77 < a, 

\fiv)\ > l/'Uo)l(i - = |/'(* 0 )| ^ 2 ^] 


4. SPECIAL METHODS FOR POLYNOMIALS 

All of the previous schemes for single equations can be employed to 
compute the roots of polynomials. Complex roots can be obtained by 
simply using complex arithmetic and complex initial estimates. Or, by 
reducing the evaluation of a polynomial at a complex point to its real and 
imaginary parts, the iterative methods for real systems (of order two in this 
case) could also be used to obtain the complex roots of polynomials. 
However, it is possible to devise special iterative methods which are 
frequently more advantageous than the general methods. We shall consider 
some of these polynomial methods in this section. 

It is of interest to note, first, that a very simple a posteriori test of the 
accuracy of an approximate root of a polynomial is frequently quite 
effective. Let the wth degree polynomial /Ux) be 


(0 


P n ix) = a 0 x n + a ± x n 1 


H-b + a, 
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with roots r u r 2 , . . r n . Then since 

(2) P n (x) = a 0 (x - /-i) -(x - r„) 

we have the well-known result that 

(3) f 5 = (-1 ) # /W r B . 

“0 

Now let a be an approximate root of P n (x) which satisfies the test 

(4) |P.(a)| < *. 

Then from (2) and (3) it follows that (we assume a n # 0) 


Pn(o) 

= 

1 - 2 . 

1 - - 


1 - — 

Cl n 


r i 

r-i 




Taking the «th root we now conclude, since 

& l n 

n\ ’ 

^l/n 

Thus we obtain an exac/ 1 bound on the relative error of a as an approxima¬ 
tion to some root of P n (x). In many of the methods to be studied, P n (a) 
is already computed and so no extra calculations are required to employ 
this test. Note that the roots and approximations in this test may be 
real or complex. 

4.1. Evaluation of Polynomials and Their Derivatives 

An interesting special feature of polynomials is the ease with which 
they may be evaluated. Let us write (1) as 

(6) P n (x) = a 0 x n + atX 11 - 1 + ■ • ■ + + a n 

= {• ■ • [(a 0 x + a x )x + a 2 ]x H- }x -f a n . 

The usual way to evaluate P n (x) is by means of this “ nesting ” procedure. 
More explicitly, to calculate P n (£) we form: 

b 0 = 

bi = b 0 £ + a u 

(7) b 2 — b 1 £ + a 2 , 


PrM I 


that 

(5) 


min 

i 


> min 1 


a / € 

~ ~r,~ \R1 


b v b v -i£ -I- q v , 


V = 1, 2,.. n; 
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and note that b n = P n (i). These operations are just those employed in 
the elementary process of synthetic division . In fact, if we write 

P n {x) = (x - OQn-i(x) 4- R 0 , 

then clearly, R 0 = P n {£) = b n and it is easily verified (by multiplying out 
and equating coefficients of like powers of x) that, using the quantities 
in (7): 

(8) Q n -i(x) = b 0 x n ~ 1 4- b 1 x n ~ 2 4- • ■ * 4- b n . x . 

Dividing again by (x — f) we get, say, 

Qn- iW = (X ~ 0Qn- 2 (x) + Rl 

and hence 

P n (x) = (X- 0 2 Qn-2(x) + (X - ORl + 7 ? 0 . 

Differentiating the last expression, we find that 7?! — />*'(£)• Thus by 
performing synthetic division of Q n -i(x ) by x — we could determine 
2n-2W and /Y(f)« Clearly this procedure can be continued to yield 
finally 

(9) P n (x) = R n (x - £) n 4 • * • 4- RAx - () + i? 0 . 

The successive calculations to determine the R v and coefficients of the 
intermediate polynomials Q v (x ) can be indicated by the array in Table 1. 


Table 1 


0 

Oq bo 

a\ 


0 


Co 

Ci 


0 


Rn 


Rn-1 


O n -2 

@n - 1 
On 


b n -2 
b n - 1 
Ro 


C n -2 

Ri 


Any entry of Table 1 not in the first row or column (which are given 
initially), is computed by multiplying the entry above by £ and adding 
the entry to the left. It follows from (9) by differentiation that 

1 d'P n {x) 
v v! dx v 




v = 0 , 1 . n. 
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From (9) it also follows easily that the polynomial with coefficients R v , say 

R(y) = R n y n + Rn-iy ,n ~ 1 + ■ ■ ■ + R x y + R 0 , 

has as roots all those of P(x) reduced by the amount f 

Finally we note that the evaluation of the entire Table 1 requires 
\{n 2 + n) multiplications and as many additions since the evaluation of 
the first column requires only n of each operation. But in computing the 
entries of the table, significant figures may be lost in any of the additions; 
see eqs. (7). Often this necessitates the use of multiple precision arithmetic. 

To employ Newton’s method on polynomial equations, only the first 
two columns in Table 1 need be computed. This is easily accomplished 
by means of the recursions (7) and a similar set with a v , b v replaced by 
b v , c v for v = 0, 1,.. n — T An interesting application of Newton’s 
method to polynomials is found in Problem 1. 

4.2. Sturm Sequences 

It would be very desirable to obtain successive upper and lower bounds 
on the real roots of a polynomial equation or indeed of any equation, as 
then the error in approximating the root is easily estimated. This could be 
done if the number of real roots of the equation in any interval could be 
determined, and this can in fact be done by means of the so-called Sturm 
sequences which we proceed to define. Let the equation to be solved be 
fo(x) = 0 where f 0 (x) is differentiable in [ a , b]. Then the continuous 
functions 


/oWJiW, • ••,/«W 

form a Sturm sequence on [a, b] if they satisfy there: 

(i) f 0 (x) has at most simple roots in [a, b ]; 

(ii) f m {x) does not vanish in [a, b]; 

(iii) if / v (a) = 0, then / v -i(a)/v + i( a ) < 0 for any root ae [a, b]; 

(iv) if/oO) = 0, then/ 0 '(oc)/i(a) > 0 for any root a e [ a , b]. 

For every such sequence there follows Sturm’s 

theorem 1. The number of zeros of f Q {x) in (a, b) is equal to the difference 
between the number of sign variations in {fo(a) y f 1 (a) y ...,f m (a)} and in 
{fn(b),fi(b),.. .,fjb)} provided that f 0 (a)f 0 (b) / 0 and {f 0 (x), /,(*),.. 
/ m (x)} form a Sturm sequence on [a, b]. 

Proof The number of sign variations can change as jc goes from a 
to b only by means of some function changing sign in the interval. By (ii) 
it cannot be f m (x). Assume that at some x e {a, b),f v (x) = 0 for some u in 
0 < v < m. In a neighborhood of the signs must be either 
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X 

/»- iW 

fix) / v + fx) or 

X 

/v - lW 

Mx) 

/vtlW 

X — € 

+ 

± - 

X — € 

— 

± 

+ 

X 

+ 

0 

X 

- 

0 

+ 

X + € 

+ 

± - 

X + e 

— 

± 

+ 


The signs in the row x — x follow from (iii). The signs in the first and last 
columns then follow by continuity for sufficiently small e. The sign possi¬ 
bilities in the middle columns are general. But we see from these tables 
that as x passes through x , there are no changes in the number of sign 
variations in the Sturm sequence. We now examine the signs near a zero 
x = x of/oCx:): 


X 

fo{x) Mx) or X 

Mx) Mx) 

x — e 

+ — X ~ € 

~ + 

X 

0 - x 

0 + 

X + € 

- - X + <f 

+ + 


The f 0 (x) columns represent the two possible cases for a simple zero. 
The sign of fi(x) then follows from (iv) and the continuity implies the 
other signs in the last columns. Clearly, there is now a decrease of one 
change in the number of sign variations as x increases through a zero of 
/ 0 (x). When these results are combined the theorem follows. ■ 

It is easy to construct a Sturm sequence when / 0 (x) = P^jc) is a poly¬ 
nomial, say of degree n. We define 

/iO) = f 0 '(x) 

so that (iv) is satisfied at simple roots. Divide / 0 (x) by /i(x) and call the 
remainder —/ 2 (x). Then divide /i(x) by / 2 (x) and call the remainder 
—/ 3 (x). Continue this procedure until it terminates [which it must since 
each polynomial / v (x) is of lower degree than its predecessor, / v -i(x)]. 
We thus have the array: 

f 0 (x) = 4iO)/iW - Mx), 
fi(x) = q 2 (x)f 2 (x) - f 2 (x), 

( 10 ) : 

fm - 2 < X ) = 9m-l(x)fm-l(x) ~ fJX), 
fm-i(x) = qjx)f m (x). 

This procedure is well known as the Euclidean algorithm for determining 
the highest common factor, / m (x), of/ 0 (x) and /i(x). [It is easily seen that 
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f m (x) is a factor of all the / v (x), v = 0, 1,.. m — 1; conversely any 
common factor of/ 0 (x) and/^x) must be a factor of/ v (x), v = 2, 3,..., m.] 
Thus all multiple roots of / 0 (x) are also roots of f m (x) with multiplicity 
reduced by one. If f m (x) is not a constant (i.e., f 0 (x) has multiple roots) 
then we may divide all the / v (x) by f m (x ) and (denoting these quotients by 
their numerators) obtain the sequence (10) in which / 0 (x) has only simple 
roots and / m (x) is a constant. It now follows that this reduced sequence 
f/oWJiW> * • -i/biW) is a Sturm sequence. Only (iii) requires proof: If 
/ v (x) = 0 then, by (10), at this point/v^x) = —/ v + 1 (x);but if/ v -i(x) = 0, 
then also / 0 (x) = /i(x) = 0, a contradiction. Simple formulas for com¬ 
puting the coefficients of the / v (x) can be obtained from (10); we present 
this as Problem 2. 

The usual way to employ a Sturm sequence is with successive bisections 
of initially chosen intervals. In this way, with each new evaluation of the 
sequence, the error in determining a real root is at least halved. Thus this 
procedure converges with an asymptotic convergence factor of at least 
The value of Sturm sequences is clearly not in rapid convergence properties 
but rather in the ability to obtain good estimates of all real roots. When 
the desired root or roots have been located it is more efficient to employ 
a more rapidly converging iteration, say false position. Or, in fact, when a 
root is known to lie in the given interval, (a, b ), because f(a)f(b) < 0, 
we may simply calculate f((a 4- b)/2 ). Now, (a + b)l 2 may be a root. If 
(i a 4- b)l 2 is not a root, then we would know from the sign of f((a 4- b)j 2) 
in which of the two sub-intervals, ( a , {a 4 b)l 2) or ((a + b)l 2, b ), /(x) 
does have a root. In this way, we may continue to bisect successive sub¬ 
intervals in which we know /(x) to have a root. This procedure is known 
as the bisection method. It has the convergence factor Each step of the 
bisection method requires fewer calculations than does the evaluation of 
the Sturm sequence. At each step of the bisection method we have upper 
and lower bounds for a real root. 

4.3. Bernoulli’s Method 

Consider the polynomial equation 

(11) P(x) = x n -f GiX 71-1 4- * • • 4- tfn-iX + a n = 0 
and assume that its roots, r are distinct and ordered by 

(12) \r t \ > \r 2 \ > > |r n |. 

This equation and its roots bear an important relationship to the difference 
equation or recursion (see Chapter 8, Section 4): 

(13a) g v (tfigv-i T * ■ 4" tf n g v -n)> v n, n -b 1,... 

(13b) go = c 0 ,. . .,gn-l = C n _j. 
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Given the n starting values, c y , the g v are easily evaluated (with n multi¬ 
plications and n - 1 additions per step). By seeking a formal solution of 
(13a) of the form g v = r\ we find that any root of (11) yields a solution. 
Furthermore, since the difference equations are linear it easily follows 
that any linear combination of the powers of the roots of (11) is also a 
formal solution of (13a): thus we may write: 

(14) g v = Vi v + V 2 V + • ■ • + V/; v = 0,1,2,.... 

The polynomial (11) is called the characteristic equation of the difference 
equation (13a). The conditions (13b) yield a linear system of n equations 
for the determination of the by The determinant of the coefficient matrix 
of this system is a Vandermonde determinant which, by (12), does not 
vanish [see (2.4) in Chapter 5]. It can also be shown that (14) is the most 
general solution of (13a), and hence with the b j as determined above, (14) 
is the unique solution of (13a, b) (see Problem 3). 

For v » 1 we have, recalling (12), g v z and hence we are led to 

consider the sequence 



Here we have assumed that Z?i ^ 0 and, in this case, clearly lim o v = r lt 

V—► 00 

The rapidity of the approach is determined by the ratio |r 2 /^i I • If this 
ratio is not near unity, r 1 is easily obtained and can be eliminated from the 
original polynomial by synthetic division as in Subsection 4.1 and then a 
new recursion is evaluated to determine r 2 , etc. (In practice, elimination 
of the roots in decreasing order of magnitude may produce considerably 
larger errors than elimination in increasing order of magnitude. See 
Problem 6.) 

If b 1 ~ 0 for an unfortunate choice of starting values (13b), it would 
seem that the ratios then converge to r 2 . This is theoretically correct but 
for actual computations the roundoff introduced in the successive evalua¬ 
tions of (13a) has the effect of altering the exact bj which should occur in 
(14). Thus after a few recursions there will be some perhaps small but non¬ 
zero component b Y present and in subsequent steps the dominant root r 1 
may still be determined. In like manner, if some error or blunder is com¬ 
mitted in the course of these Bernoulli iterations the subsequent steps, 
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performed correctly, will obliterate the error and yield the correct result. 

The expansion in (15) shows that the convergence of <j v to r x is geo¬ 
metric with ratio |r 2 /r!|. To test the convergence of the sequence {a v }, 
a number of devices can be employed. Perhaps the most frequently used 
procedure in practice is to compute the differences |or v + 1 — a v \ and to 
stop when these are less than some predetermined small quantity. Another 
possibility which yields more precise information at the cost of some extra 
computation is to use the test (4). Of course, if sufficiently many steps 
have been performed, a — a v is closest to r u the dominant root, and so 
j = 1 in (5). Test (4) should not be made after every iteration but say 
every several steps to reduce the computational effort. 

The conditions (12) are most likely not satisfied for a polynomial since 
in general some conjugate complex roots are to be expected. Suppose then 
that the dominant real roots have been eliminated and (11) is the reduced 
equation with r x and r 2 conjugate complex roots, r 2 = r u satisfying 

(16) ki| = |r 2 | > |r 3 | > > |r„|. 

The solution of (13) is again of the form (14) where the bj may be complex 
and while (15) is valid the cr v do not converge to r x since |r 2 /ri| = L 
For large v we now expect that 

g v x Vi v + b 2 r 2 v . 

Here we note that r 2 = r 1 and b 2 = b x since the g v must be real (assuming 
that the a j and c j are real). A simple calculation now reveals that 

A v = g v + 1 g v -i - g v 2 X [*i[ 2 ki - t 2 ) 2 (r ir 2 ) v " 

5, s - gv + igv ~ l^iIVi ~ r 2 ) 2 {r 1 r 2 y~ 1 (r l + r 2 ). 

Thus we expect that with ^ s B v /A v and p v = A v + 1 /A v : 

Iim s v = r x -h r 2 = s, lim p v = r 1 r 2 = p, 

V-» 00 V00 

and the roots r x and r 2 are those of the quadratic equation 

f 2 — s£ 4- p = 0. 

Recalling (16) we can now estimate the error in this procedure for 
complex roots. Let (^1 = \r 2 \ = r , |r 3 |/r = 8 and we find as above: 

gv = Vi v + b 2 r 2 v + (P(r v 8 v ), 

Av = |*i| a (ri - r 2 ) 2 (r 1 r 2 y- 1 + 0(r 2v S v ), 

Bv = l^l 2 ^ - r 2 )' 2 (r 1 r 2 y~ 1 (r 1 + r 2 ) + 

Thus by the usual expansions 

(17) s, = s + 0(8 V ), P v=P + 0(b v ). 
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The equation actually solved is the quadratic 

(18) e - s v e+p v = 0. 

It is easily shown, since s 2 — 4p # 0, that the roots of this equation 
differ from those for v — oo by terms $(S V ). 

If the dominant roots are equal but of opposite sign, then the above 
procedure for complex roots is still applicable where now r 2 = — r x . If 
the dominant root is real but multiple, then the original procedure is 
applicable but converges more slowly (see Problem 5). 

4.4. Bairstow’s Method 

A much better scheme (than Bernoulli’s) for determining quadratic 
factors of a polynomial, P n (x), is based on a generalization of synthetic 
division and Newton’s method for systems of equations. This procedure 
also avoids the use of complex arithmetic. In brief, we seek real numbers, 
say s and /?, such that the quadratic 

(19) x 2 -b sx + p 

is an exact factor of P n (;c). The division of P n (x) by this factor may be 
indicated as: 

(20) P n {x) = O 2 + sx + p)Q n - 2 (x) + [xRi(j, p) + R 0 (s,p)]. 

Here, and R% are the coefficients in the remainder which is at most 
linear in x, As indicated, these coefficients are functions of s and p , the 
parameters in the quadratic (19). In order that the remainder be zero, s 
and p must satisfy 

(21) Rfap) = 0, R 0 (s,p) = 0. 

This is a system of two equations in two unknowns which we solve by 
Newton’s method. For this purpose we must evaluate the four derivatives 

dR 1 8R 0 

3s 9 ds 

( 22 ) 

8R 1 8R 0 

dp 9 dp 

We determine the quantities in (22) indirectly. Let i n (20) be 

divided by the quadratic (19) to yield 

(23) Qn~2(x) = (x 2 + sx + p)Qn- 4 (*) + [xR 3 (s, P) + ^ 2 (^p)]. 

We note that the specific values of the Rj(s, p) in (20) and (23) for any 
fixed (s, p) are easily obtained by carrying out the two indicated divisions. 
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Using (23) in (20) we have the identity: 

(24) P n (x ) = (x 2 + SX + p) 2 Q n _ i(x) 

+ (x 2 + sx + p)[xR 3 (s, p) + R 2 (s, p)] 

+ [v/?i(.y, p) + R 0 (s, p)]. 

Since P n (x) is independent of s and p , we can differentiate (24) with respect 
to s and p and evaluate the result at a root x — z of z 2 + sz + p = 0 
to get: 

z(zR 3 + R 2 ) + (z ™i + j = 0, 

(zR 3 + R 2 ) + (z = 0. 

Since z 2 = — (sz 4- p) these equations can be written as 



If the quadratic is not a perfect square (in which case the roots would 
be real and equal), the above must hold for two distinct roots, z, and hence 
each coefficient in parentheses must vanish. Thus we deduce that: 


(25) 


it - 


dR , 
dp 




d_R 

ds 

dRo 

dp 


r = 


= -R*. 


The iteration scheme proceeds from an initial estimate ( s Q , p 0 ) which is 
such that p 0 ^ s 0 2 /4 and defines recursively the sequence (^ v , p v ) by 
Newton’s method of solving (21); i.e., 


(„ „ \ ^^1 C?V> Pv) , ( „ „ \ ^^1 (^v» Pv) D n n \ 

Uv + 1 •S’v) (Pv + l Pv) Rl(S Vi p v ). 


dp 


(26) 


(„ „ \ ^Rq (■S’vj pv) , / ^ \ (^v> Pv) _ D /„ K \ 

Uv+1 **v) ^ T \Pv +1 Pv) ^ ^Ov^vj Pv)* 


^p 


The coefficients and inhomogeneous terms in (26) are obtained by carrying 
out the two divisions indicated in (20) and (23) with the quadratic factor 
x 2 + s v x + p v and then using (25). The divisions can be reduced to the 
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evaluation of simple recursions by equating the coefficients on both sides 
of the equality signs in (20) and (23); we leave the derivation of this 
generalized synthetic division as an exercise. During the course of the 
iterations it should be checked that p v ^ s v 2 /4 , so that (25) is valid. To 
test the convergence we may employ (4) and (5), after noting that if z 
is a root of x 2 4- jjc 4- p = 0 then, by (20), 

P n (z) = zR^s, p) + R 0 (s, p). 

Usually the coefficients (5 V , p v ) converge quickly, and so a direct test on 
the roots need only be applied when the iterations are about to be 
terminated. 


PROBLEMS, SECTION 4 

1. The square roots of a positive number jS are the zeros of the polynomial 
f(x) = x 2 - p. 

(a) Apply Newton’s method to obtain the sequence of approximations 



This procedure is known as the Newton-Raphson method for computing 
square roots; it is frequently employed on high-speed digital computers. 

(b) If x 0 > VjS show that: x v + i < x v , v = 0, 1,... (assuming exact 
calculations). 

(c) Derive and study the analogous procedure for the / 7 th root of any positive 
number where n is an integer. 

2. Derive a recursion formula for finding the coefficients of the Sturm 

n 

sequence ( 10 ). Assume fo(x) = 2 a u oX n ~*. 

i = o 

3. Show that every solution of (13a, b) is unique and hence can be repre¬ 
sented in the form (14). 

4. If the coefficients a 1 in the polynomial P n (x) = x n 4- a 1 x ri “ 1 4- • • • 4- a n 
are altered by an amount at most e, then the polynomial P n , € {x) = 

PnM 4- e£ n ~i(x) is obtained where £*_i(*) = b 1 x n ~ 1 + b 2 x n ~ 2 H-1- 6 n _i 

and |^| < 1 . 

Show that to each simple real root r, of P n (x), corresponds a simple real 
root r it€ of P n , e (x) such that r it€ — r t = (9(e) as e 0. 

[Hint: Plot P n .c(x) in a neighborhood of r t for sufficiently small €.] 

5. Show how to modify Bernoulli’s method for the case of a dominant 
real multiple root. Estimate the convergence rate. 

6. If P n W = x n + aiX n ~ 1 + • • • + a n - X x 4- a n with a n # 0, then let 
Q n (z) = a n z n 4- ci n _ iZ n ~ 1 + • • * + a x z 4- 1. Show that the roots {z k } of 
Q n (z) = 0 and roots {x k } of P n (x) = 0 are related by z k = l/x k ; k — 1, 
2Hence, show how the Bernoulli method may be used to find the 
zeros of P n (x) in ascending order of magnitude. 




4 


Computation of Eigenvalues 
and Eigenvectors 


0. INTRODUCTION 


The eigenvalue-eigenvector problem for a square matrix A = (a iy ) of order 
n is that of determining a scalar A and vector x such that 

(1) Ax = Ax, x 7 ^ o. 

The problem is clearly non-linear since both A and x are unknowns. 
In fact, as is well known, the eigenvalues , A, are the n roots of the character¬ 
istic equation 

(2) det (A I - A) = p A ( A) = 0. 

Thus the eigenvalues can, in principle, be found as the zeros of /^^(A) 
without recourse to any of the eigenvectors. Given some eigenvalue, A, 
then the corresponding eigenvector is a non-trivial solution of the homo¬ 
geneous linear system (1). On the other hand, if some particular eigen¬ 
vector x is known, then the eigenvalue to which x belongs can be obtained 
by taking the inner product of ( 1 ) with x to get 


( 3 ) 


A = 


X*Ax 

x*x 


Alternatively, we could use any non-zero component x t to get, from the 
ith row of ( 1 ), 


1 n 

A = 7 2 a 'i x >- 

X i L 

If some of the n eigenvalues of A are not distinct [i.e., if p A ( A) has 
multiple roots] then there may be fewer than n eigenvectors. For example, 


A = 
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has the repeated eigenvalue A — 1, and only one eigenvector, 



We consider, in Section 1, the well-posedness of the eigenvalue problem 
and a posteriori error bounds. 

In Section 2, we study the power method and its ramifications. These 
yield a sequence of scalars and vectors that converge (when the procedure 
works) to some particular eigenvalue and its corresponding eigenvector. 
In order to compute other eigenpairs, the iteration scheme must be 
modified. These simple methods yield successively only a few of the 
eigenvalues and vectors with acceptable accuracy, but for many applications 
this will be all that is needed. 

The methods studied in Section 3 use a finite or infinite sequence of 
matrix transformations to find a similar matrix B = P~ l AP , such that the 
evaluation of det (XI — B) is simpler than the calculation of det (XI — A). 
We then find the eigenvalues as the zeros of p B ( A), by means of an iterative 
scheme, which does not explicitly use the coefficients of p B . In the methods 
based on the use of an infinite sequence of matrix transformations, the 
eigenvalues themselves are usually explicitly exhibited in B. A special 
calculation is then required to obtain the eigenvectors. In summary, 
these methods may yield all of the eigenvalues without determining any 
eigenvectors. 

[We emphasize the practical importance of not finding explicitly the 
coefficients of p B ( A) or p A (X) in order to evaluate the polynomials. All 
experienced practitioners are aware of the large errors that may result 
from the use of the approximate coefficients of p B ( A) or p A ( A) for the 
calculation of the zeros of the characteristic polynomials. We do not 
study the methods based on finding the coefficients of /^(A).] 


1. WELL-POSEDNESS, ERROR ESTIMATES 

A simple criterion for localizing eigenvalues is given in 

theorem 1 (gerschgorin). Let A ~ (a i} ) have eigenvalues X and define 
the absolute row and column sums by 

n n 

n = 2 1^1’ c i = 2 Kl- 

1=1 i = 1 


Then , 
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(a) each eigenvalue lies in the union of the row circles R t , i — 1,2 
where 

R t = {z \\z - au\ < n}; 

(b) each eigenvalue lies in the union of the column circles C } , j ~ 1, 
2,..., n> where 

Cj — {z | \z a } j\ < Cj \, 

(c) each component (maximal connected union of circles) of [J R { or 

i 

U Q contains exactly as many eigenvalues as circles . {The eigenvalues and 

i 

the circles are counted with their multiplicities.) 

Proof (a) If A is an eigenvalue of A , then there exists an eigenvector 
x / o such that 

Ax = Ax, 
or 

n 

(A - a it )Xi = 2 a u x i’ ‘ = 1, 2,..n. 

;‘= i 

Pick an i such that |jc*| = \\x\\ao # 0. Then 


|A - a„| < 2 \ a a\ 

}*t 


*1 


< r { . 


(b) Since the eigenvalues of A and A T are identical, (b) follows 

from (a). 

(c) Here we apply a simpler form of the basic lemma of the 
theory of functions of a complex variable quoted in Chapter 1, Section 3, 
namely: 


lemma 1. The n zeros of a polynomial 

p{x) = x n + a 1 x n ~ l -4-b a n -iX + 

are continuous functions of the coefficients {aj). 

Consider the one-parameter family of matrices A{t) = D -f tB, where 
D = is a diagonal matrix and B = A — D. Consider, correspond¬ 

ing to A(t ), the circles R t (t) and Cft) with respective radii tr x and tCj 
about respective centers a ti and a ;7 . Now, /I(0) = D and clearly (c) holds 
for this diagonal matrix. The eigenvalues (A(/)} of A{t) are the zeros of 
/^^(A) = det [XI — A(t)]. As t increases continuously to t == 1, the 
integer number of circles and, by Lemma 1, of eigenvalues in each com¬ 
ponent of U *i(0 [° r °f U Q(0] varies continuously, and hence is 

i i 

constant, except for a finite number of values, t = t l , at which the number 
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of circles in a component increases. At such values t — t l , (c) holds 
because of Lemma 1. In other words, when the number of circles in a 
component increases, the number of eigenvalues in the component 
increases by the same amount. ■ 

Gerschgorin’s theorem has many applications; the first we make is in 
the treatment of the well-posedness of the eigenvalue problem. In this 
connection, it is necessary to introduce the notions of left and right 
eigenvectors. 

The eigenvector x that satisfies (0.1), 

Ax — Ax, x # o, 

is more properly called a right eigenvector of A . Correspondingly a left 
eigenvector , y, of A is a vector that satisfies 

(1) y *A = /ay*, y # o. 

In other words, y is a left eigenvector of A iff y is a right eigenvector of A*, 
i.e., by starring both sides (by taking the complex conjugate transpose of 
both sides), 

A*y = fly. 

(y* is also called a row eigenvector of A.) The sets of “left” {p] and “right” 
{A} eigenvalues are identical, since 

det (XI - A) « det (A/ - A) = det (XI -A*\ 

where the last equality follows from det B = det B T . 

Now, when the matrix A is Hermitian, the left and right eigenvectors 
are also identical; otherwise, when A is not Hermitian, the left and right 
eigenvectors are distinct, in general. 

lemma 2. Any left eigenvector , y, and any right eigenvector , x, of the 
matrix A , corresponding to any pair of distinct eigenvalues , are orthogonal , 
Le ., y*x = 0. 

Proof Given as Problem 1. ■ 

We say that the left and right eigenvectors are biorthogonal. 

We may now ask, “How well-posed is the eigenvalue problem (0.1) 
or (1)?” The answer is given in a special case by 

theorem 2. Let A be of order n and have n linearly independent eigen¬ 
vectors . For any fixed matrix C, with || C|| = \\A || , define the perturbed matrix 

A(e) = A 4- *C. 


( 2 ) 
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Then , if A is any eigenvalue of A , there is an eigenvalue A(e) of A(c) such that 


Moreover , if A is simple 


IA(e) - A| = 0(e). 


A(e) - A _ y*Cx 


where x Arccf y are , respectively , r/g/r/ and left eigenvectors of A corresponding 
to A. 

Proof Let P be the matrix whose columns are the right eigenvectors 
of A , i.e., 

(5) AP = PA 

where A = (A tJ ) is a diagonal matrix, such that {A^} are the eigenvalues of 
A in some order. Therefore, from (5) and (2), 

P~ 1 A(€)P = A + eP “ 1 CP = A + eP, 

( 6 ) 

E = (e„) ■ P 'CP. 

The estimate (3) now follows from Gerschgorin’s theorem part (a), since 
the eigenvalues of A(e) are unchanged by the similarity transformation. 

We can now improve upon (3), if the eigenvalue A = A u is simple. 
That is, we observe that the circle R { (c) of the matrix P~ 1 A(e)P will not 
intersect any other circles Py(V), if |e| is small enough, since A^ is distinct 
from all other Therefore, there is a unique simple eigenvalue A(e) in 
Ri(e) and therefore also a unique corresponding eigenvector x(e) of A(e)> 
Now, the eigenvalues A, A(V) and eigenvectors x, x(e) satisfy 

(A + eC)x(e) = A(e)x(e) 

( 7 ) 

Ax — Ax. 

Therefore by subtraction, 

(8) A[x( € ) - x] + tCx(e) = [\(e) - A]x(e) + A[x( e ) - x]. 

If we left-multiply both sides of (8) by the row vector y*, that satisfies 
y*A — Ay*, then 

(9) <y*Cx(<) = [Me) - A]y*x(e). 

The conclusion (4) follows if we show that 

y*x / 0 

and that we may select x(e) and x so that, as e -> 

x( e ) -> x. 


0 , 
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The fact that y / o is orthogonal to each of the n — 1 other right eigen¬ 
vectors of A , establishes the non-orthogonality of y and x. From the 
relation (3), equation (7), and the assumed simplicity of A, we know that 
the matrices A — XI and A + eC — A(*)/ have rank n — 1 and that we 
can delete the same row and the same column from each to form non¬ 
singular submatrices, if |e| is sufficiently small. Hence, if the deleted 
column is the yth, we may set xfe) = 1 and then, from Problem 3, x(c) 
will converge as \e\ —► 0, to an eigenvector x satisfying (7). ■ 

corollary (bauer-fike). Under the same hypothesis and definitions as are 
used to establish (3), each eigenvalue A(e) of A + eC satisfies 

(3') mm |A(«) — A| < |c| ||C|| p |[J D - 1 || )J ||i 0 || p , 


for any p-norm , ||x 



i Ip 

, with 1 < p < oo. 


Proof Given as Problem 5. ■ 

Observe that the algebraic Lemma of Section 3 in Chapter 1 suggests 
only that (3) holds if A is a simple zero of p A (X); but if A has multiplicity 
m, then |A(e) — A| = 0(e 1/m ). On the other hand, although (9) holds for 
the general case, (4) may not hold because y*x may vanish as shown in 
Problem 2. 

We see that when (4) holds, the well-posedness of the eigenvalue problem 
for determining A depends on the magnitude of 


y*Cx 

y*x 

If we normalize the vectors, so that ||y|| 2 = ||x[| 2 — 1, and use the Schwarz 
inequality (see Problem 6), there results 


( 10 ) 


y*cx |c || 3 

y*x “ jy*x| 


In the special case that C = [|^41| 3 C/, where U is a unitary matrix (i.e., 
U* = U~ 1 ) such that t/x = y (see Problems 9 and 12), we find that 
IC1 2 = || A 1 2 and furthermore 


(10') 


y*Cx 

y*x 


y*y 

y*x 


\Ak = 1 £ 1 *. 

y*x| |y*x| 


Now, it is possible that the problem of finding A, an eigenvalue of A, is 
not well-posed, although the problem of finding the same A as an eigen¬ 
value of B = P~ l AP is well-posed (this fact is illustrated in Problem 13). 
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The reader may on first impulse think that this statement is contradicted 
by Theorem 3 given in Problem 11. But a closer examination of Theorem 3 
indicates that although the perturbed matrices A + <rC and B + eD are 
in the one to one correspondence D = P~ 1 CP , the magnitudes of the 
corresponding matrices, \\D\\ and ||C||, may differ considerably unless 
IIP|| • ||P _1 || is not very large, since (see Problem 7) 

(11) W^Wi ~ l|7> “ lcp| - 


On the other hand, in the test for well-posedness of A, it is implied that we 
consider only perturbations such that ||C|| = ||/t| or such that \\D\\ = ||5||. 
Hence no contradiction is involved. 

We may then justifiably say that when (4) holds for all C and |y*x| is 
not small (for say ||y|| 2 = ||x|| 2 = 1), the eigenvalue problem for A of A 
is well-posed , since from (4), (10), and (1 O'), with \\C\\* “ Mila, 


V A(e) - A 
max lim - 

{C> f-*■ 0 € 



We have the immediate consequence of (10). 


theorem 4. If A is Hermitian and A any eigenvalue of A , then the eigen¬ 
value problem for A is well-posed. 

Proof If A is not simple, use (7) and Problem 4 to find x(e) and x 
with ||x(e) — x || -> 0. Now set y = x in (9) and note that y*x(<r) 1. ■ 

Fortunately, the most common matrix transformation methods use 
orthogonal or unitary similarity transformations of A to produce a simpler 
matrix B which, by accounting for roundoff errors, can be written as 


B(e) = p-\A + *C)P . 


The matrix P is unitary and, depending on the kind of arithmetic used, 

(12) He'll* <1; M = 0(*MO-‘<i), 

where a = max \a {j \, p < 2, and t represents the number of figures used 

i. i 

in single precision arithmetic. A priori error estimates of the form (12) 
have been obtained by Givens and Wilkinson. We shall not pursue this 
topic but rather give an account of some a posteriori error estimates. 


1.1. A Posteriori Error Estimates 

In the case of a Hermitian matrix, A , we can give a simple error estimate 
for a computed eigenvalue, A, in terms of the residual vector , ij. 
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theorem 5. Let A be a Hermitian matrix of order n and have eigenvalues 

{A*}. If 

(13) Ax — Ax s rj, x ^ o, 
then 

min | A - A,| < M- 2 - 
* ll X l|2 

Proof Since A is Hermitian there exists, in C n , an orthonormal basis 
of eigenvectors {u f } such that 

Au t = A fUi , i = 1,2,.. ., n . 

Therefore, we can express x as a linear combination 

n 

(14) x = ^ 

i = 1 

with a { = u ( *x. From (13) and (14), 


Hence 

Therefore 

(15) 

with 

Note that 

whence 

The conclusion 

is thus established. 


7) = 2 «A - A )“l- 

t = 1 

= 2 i a ‘i 2 ( A < - A ) 2 - 

( = i 

= 2 - A ) 2 - 


*1**) 

x*x 


*«-- 


i«.i s 


2 M 2 

/ = i 

b{ > 0 , ^ b t = 1 , 

i = 1 

n 

min (A t — A) 2 < 2 ^i(A t — A) 2 . 


min | A — A t | < 


x !l2 


An obvious way to improve upon A when given an approximate eigen¬ 
vector, x, is suggested by Theorem 5, namely 
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theorem 6. Given A , a Hermitian matrix , and x, the residual vector rj 
defined by (13) is minimal for 


(16) 


A = A* 


x*Ax 

x*x 


Proof The quadratic expression in A 

|| y )|| 2 2 = (x*A — Ax*)(/lx — Ax) = x*A 2 x — 2Ax*Ax + A 2 x*x 
assumes its minimum at A = \ R . That is, 

min y)*tq = x*.4 2 x — A* 2 x*x. ■ 

\ 

The quantity A K defined by (16) is called the Rayleigh quotient and will 
be referred to later on in the discussion of iterative methods. 

For the Hermitian matrix T, with eigenvalues {A*} and corresponding 
eigenvectors (u t }, let U T denote the linear space spanned by the eigenvectors 
Uj, i — 1, 2,..r. If we know something about the spacing of the eigen¬ 
values, we may be able to estimate the error, ||x — U r \\ 2 , of an approximate 
eigenvector x. (We use here the notation for distance between a vector x 
and a set S: 


||x - 51| = g.l.b. IIX - y||.) 

yeS 

That is, 

theorem 7. If\\ — A| < |[vj|| 2 i' = 1, 2,..., r, with A, tj, and x that 
satisfy (13), (14); [A* — A[ > d > 0 for i = r + 1, r + 2,..«, then 

(17) ||x - U r \\ 2 < ||x - 2 fl ‘ u ‘!2 - 
Proof From the above definition and (14), 

(18) ||x - U T || 2 2 < ||x - 2 a Vih 2 = 2 N 2 - 

i = l t = r + 1 

Now by (13) and (14) 

NIU 2 = 2 \a, | 2 (A, - A) 2 > J N 2 ( A < - A ) 2 - 

i = 1 i = r 4- 1 


Hence by the hypothesis on the spacing of eigenvalues, 


or 


d 2 i N* < Nils 2 , 

i = r + 1 


n 



N 2 



< 
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Now, from (IB), the estimate (17) follows. ■ 


Theorems 5 and 7 give us a posteriori estimates for the error in an 
approximate eigenvalue and eigenvector of a Hermitian matrix. These 
estimates do not require detailed information about all of the eigenvalues 
and eigenvectors of the matrix. Unfortunately, for a non-Hermitian matrix, 
the situation is more complicated and we state 

theorem 8 (franklin). Let A be of order n , and have a set of n linearly 
independent eigenvectors {u t }, eigenvalues (AJ, which satisfy AU = U A, 
with U = (u 1( u 2 , u„), A = (A,8„). 

If for some e > 0, 

(19) Mx - Ax ||2 < e||^x|| 2 , 
then 

(20) min 1 _ A < e |C/|| 2 ||t/-i| 2 . 

A,*0 

Proof Define b = t/ _1 x, so that 

x = C/b, 

(21) 

Ax - Ax = U( A - A/)b. 

Now y = U~ 1 (Uy), implies ||y|| < ||U‘ 1 || ||f/y||. Therefore 

(22) \\Uy\\ > ||C/- 1 ||- 1 ||y||. 

With y = (A — A/)b, (22) and (21) imply 

(23) Mx - Ax|| > ||t/- l || “UKA - A/)fe||. 

But from (19) and (23) 

€ \\UAb \\ 2 = € \\Ax \\ 2 > \\Ax - Ax 1 2 > ||C/- 1 || 2 - 1 ||(A - A/)b|| a . 
Hence 

e||t/|| 2 ||Ab|| 2 > ||l/ _l || 2 " l ||(A - A/)b|| 2 . 

Therefore, since Ax # o implies Ab / o, 



(24) 





theorem 9. The hypothesis of Theorem 8, with (19) replaced by 

(19') ||/lx - Ax1 2 < «||x|| a , 

implies 

(20') min |A - A t | < e|| C/" 1 1| 2 1| 1| 2 . 

i 

Proof Left as Problem 8. ■ 

Unfortunately, Theorems 7 and 8 require information about the matrix 
of eigenvectors, which is not generally available in problems where we are 
only interested in obtaining a few of the eigenvalues and eigenvectors. In 
the special case of a Hermitian matrix, A , the matrix of eigenvectors, U, 
is unitary (see Problem 9), whence ||C/|| 2 = ||^ _1 ||2 = L In this case, the 
estimate (20') of Theorem 9 reduces to the estimate given in Theorem 5. 
If, on the other hand, for the case that the eigenvectors (u t ) of A form a 
basis of C n , we have a set of vectors {vj that approximate the eigenvectors, 
then let P = (v 1; y 2 , ..., v n ). 

Clearly, if we define R and e by 

P-'AP = A + eR 


where A = (A ( 8 ty ), ||P|| = 1, then e is small when Pis a good approximation 
of U. The Gerschgorin circles may now be small enough to give close 
estimates for the eigenvalues. 
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1. If y is a left eigenvector and x is a right eigenvector of A t corresponding to 
distinct eigenvalues, show that y*x = 0. 

2. (a) Show that the left and right eigenvectors corresponding to A = 1, 
are orthogonal for 

' ■ c :>• 

(b) Find the eigenvalues A(e) for A{t) = A + eC, 



and verify that 

|A(f) - A| = 0(Ve). 

(c) Find the eigenvectors x(e) of A(e) and verify by substitution that 
(9) holds and that both the eigenvectors x(e) converge to the eigenvector of A. 

3. Let 5 and 5(e) be of order «, have rank n — 1, and 


5(e) B as e —► 0. 

Show that for |e| sufficiently small there exists a vector x(e) in the null space 
of 5(e), i.e., 


such that 


5(e)x(e) = o, 


||x(e) ||oo > 1 

and 


where 


x(e) ^ x(0) as e ^ 0, 


5x( 0) = o. 


[Hint: Since 5(e) —^ 5, we see that for all sufficiently small |e|, the (n — l)st 
order square submatrices B tj of 5 and 5u(e) of 5(e) found by deleting the ith 
row and y'th column from each of 5 and 5(e) for some pair of indices (/,;'), 
are non-singular and 5 ( /e) —> B u . Hence, the Gaussian elimination method 
may be used to triangularize 5 iy . The same pivotal elements may be used to 
triangularize 5 iy (e) for |e| sufficiently small. Set x 7 (e) = 1 and solve 5(e)x(e) = o 
by using the above triangularization.] 

4. Let 5 and 5(e) be of order n , 5 have rank n — r, 5(e) have rank < n , 
and 

5(e) -> 5 as e — > 0. 

Let the null space of 5 be S = {x | 5x = o}. 

Show that for |e| sufficiently small, there exists a vector x(e) in the null space 
of 5(e), such that 

l|x(e)||» £ 1, 

min ||x(e) — x|| — > 0 as e -> 0. 

xcS 


5. Carry out the proof of (3') (see corollary to Theorem 2). 
[Hint: Let x(e) ^ o be any eigenvector satisfying 

(A -h eC)x(e) = A(e)x(e). 
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With the matrices P and A used in the proof of Theorem 1, 

[A(e)7 - A]/ >-1 x(e) = eP^CPlP^e)]. 

If \(e)I — A is singular, (3') holds. If A(e)7 — A is not singular, 

P~'x(€) = c[A(e)7 - A]- 1 7 > ~ 1 C7 > [7 > ~ 1 x(e)]. 

Take any norm || • || p of both sides to get (3')] 

6. Apply Schwarz’ inequality (i.e., |y*x| < ||y|| 2 ||x|| 2 ) to show that 

|y*Cx| < ||C|| 2 ||y|| 2 ||x|| 2 . 

7. Show that 

ip- Tin rs f * II^cpI < Iic||«p- 1 || l|P||. 


[Hint: C = P(P~ 1 CP)P~ 1 .] 

8. Prove Theorem 9. 

[Hint: Estimate 


||(A - A/)b|| 2 2 
2 M 2 2 


2 I A, - AHM 2 

i -----1 _ 


2 N 2 

r= i 


> min ]A f — A| 2 .] 


9. If U is unitary, show that \\U\\ 2 = \\U~ 1 \\ 2 = 1. 

10. If 7 > m (A) = c 0 A m + ciA m_1 + • • • + c m , and A has the eigenvalues {A { } 
and the eigenvectors {uj, then P m (A) has the eigenvalues {P m (A ( )} with the same 
eigenvectors. 

If A is Hermitian, show that the Rayleigh quotient, defined by (16) for any 
x 9 * o, satisfies 

Ai < A* < A n , 


where Ai and A n are the smallest and largest eigenvalues of A. 

11. Prove 

theorem 3. Let x be a right eigenvector and y a left eigenvector of A for the 
eigenvalue A. For the similarity transformation given by any non-singular P , set 

B = P~ X AP, B(e) = P 1 A(e)P. 

Then y*Cxfy*x is invariant under P. 

[Hint: u = 7 >_1 x is the right eigenvector of B corresponding to x; v s P*y 
is the left eigenvector of B corresponding to y; eP~ l CP is the perturbation 
corresponding to eC. Hence, with D = P~ 1 CP 9 

v*7)u _ \*P~ 1 CPu _ y*Cx I 
v*u v*u y*x J 

12. Given x and y with ||x |[ 2 = ||y|| 2 = 1, construct a unitary matrix C 
such that Cx = y. 

13. Construct the matrix A which has the eigenvalues A, and the cor¬ 
responding eigenvectors x, where x lk =1 for 1 < k < n, x kk = 8 for 
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2 < k < n, and x u = 0 otherwise. Show that the left unit eigenvectors y y 
are determined by the biorthogonality property. Hence verify that 

l * tl " 4drh)* and iy * x|-1 = Knir 1 )- 

(Therefore, for small 5, the eigenvalue problem for finding Ai is not well- 
posed. But the eigenvalue problem for B = P~ 1 AP is exceedingly well-posed 
if P = (Xi, x 2 >. - x n ).) 

[Hint: Start with the diagonal matrix B = (AjS^) and P . Construct P~ x and 
then A = PBP 1 . Sketch a picture of x j and y, in the three dimensional case.] 


2. THE POWER METHOD 

The power method, in its basic form, is conceptually the simplest 
iterative procedure for approximating the largest or principal eigenvalue 
and eigenvector of a matrix. Let us assume, throughout this subsection, 
that the «th order matrix A has real elements (a i} ), n linearly independent 
eigenvectors {u ; }, j = 1, 2and a unique eigenvalue of maximum 
magnitude, i.e., the eigenvalues satisfy 

IA x [ > |A a | > |A a | > •> |A n |. 

Since the A ; are the roots of a characteristic polynomial with real co¬ 
efficients, the complex eigenvalues occur in complex conjugate pairs. 
Hence A : is real. Since Ui satisfies A u x = AjUi, the components (w fl ) of 
u : may be taken to be real. 

Let x 0 be an arbitrarily chosen real ^-dimensional vector and form the 
sequence of vectors 

(1) x v = Ax v _ x = A v x 0 ; v — 1,2,.... 

Since A has a complete set of eigenvectors {u ; }, say with components 
(w i; ), there exist n scalars a, such that 

n 

(2) x 0 = ^ a i u f 

i= i 

Then the sequence (1) has the representation 

n 

x v = 2 x i‘ a ’ u ’ 
p) ,=1 

= v[a lUl+ i(A ; )V.]; v = 1,2,.... 

Now clearly, since [ A y /A x | < 1 for all j > 2, the directions of the vectors x v 
converge to that of Uj as v->oo, provided only that ^ 0. Of course, 
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A/, in general, either converges to zero or becomes unbounded, and so 
the sequence (1) may not be practical for computations. However, a 
simple scaling of the iterates x v , to be introduced later, will remedy this 
defect. 

Since x v , for large v, may be a close approximation to the eigenvector 
belonging to A x , we can employ the methods indicated in the introduction 
to approximate the eigenvalue. Thus let us form the ratio of, say, the kth 
components of x v and Ax v = x v + 1 . Call the ratio cr v + 1 and get from (3), 




( 4 ) 


4-1 — 


= Ai 


m 


a t u kj 


a,u kl + 




If a x ^ 0 and k is chosen such that u kl / 0, then for v so large that 
|A 2 /A 1 [ v « 1, (4) yields 


( 5 ) 


K 



Thus, in approximating the eigenvalues, the growth or decay of A/ 
causes no theoretical difficulties. The convergence of a v to A t is seen to be 
at least of first order with ratio at most | A 2 /A x [. As in our previous studies 
of iteration schemes (e.g., Chapter 2, Section 4; Chapter 3, Section l), 
we may define the rate of convergence as 


( 6 ) 


R = In 


Ai 

A 2 


Difficulties in convergence may be expected if the first two eigenvalues 
(in magnitude) are “close.” 

Another way to approximate A x is by means of the Rayleigh quotient 
indicated in (0.3). Thus we define 


( 7 ) 


x v *^x v 

x v *x v 


x v *x 

X * 
Ay 


V + 

x v 


1 


Then from (3) and (7) we have 


7 v + i - A x 


I, I fe)’(£r 
I, 


A calculation reveals that a' converges to A x just as does a v [i.e., as in 
equation (5)]. However, if the vectors u t are mutually orthogonal or say 
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for convenience orthonormal , i.e., 11**11, = 8^, then by Problem 1, A is 
symmetric and has real eigenvalues and eigenvectors, so that 


( 8 ) 

Hence, 

(9) 

when A is symmetric, and the rate of convergence in this case is twice 
that of the scheme used in (4). Of course, this gain in using (7) is only 
achieved when the matrix A is symmetric. An interesting motivation for 
using the approximation (7) in general is furnished in Problem 2; this 
result should be compared with that of Theorem 1.6. 

In order to terminate the iterative computations (1) and (4) or (7), a 
variety of different tests can be suggested. For instance, if the quotients 
in (4) agree for several values of k (i.e., ratios of several components) 
then a fairly good approximation to A x has been obtained. Usually the 
obvious test of little or no change in the eigenvalue iterates o v or a' v for 
several successive values of v may be successfully employed. However, 
a quantitative test, based on Theorems 1.8 or 1.9, requires little additional 
computing and yields very precise information when A is symmetric. 
That is, pick an arbitrary e > 0 and iterate until 

(10) Mx v - 0X V 1 2 < e|Mx v || 2 = e||x v + 1 || 2 . 

Here o — <r v + 1 or <r' v+1 . From Theorem 1.8, we then find 

(11) min 1 -f ^€||C/|| a .||^- 1 || a . 

j A i 

For sufficiently large v we can be assured that the minimum in (11) is 
attained for j = 1 [assuming as usual that a x ^ 0 in (2)]. 

Thus, a bound on the relative error in approximating A x is attained. 
However, the quantities ||t/|| 2 and ||U _1 || 2 cannot be estimated in the 
general case. But if A is symmetric, the matrix U is unitary and 

\\ U \\ 2 = = 1 . 

Thus for the symmetric case the condition (10) implies the precise bound 




(12) 


< €. 
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The present iterative scheme usually requires that the iterates x v be 
sealed at each step of the process. Thus in place of the sequence (1), we 
actually calculate, with arbitrary y 0 , 

(13) ?v = ^y v -i. yv = 7?v; •'=1,2,.... 

The sequence of scale factors, ^ v , can be chosen in a variety of ways. 
The two most commonly used choices are 

(14a) j v = max |^ >v | = H?,!*, 

y 

and 

04b) 5; = 15.11,. 

The choice (14b) requires the taking of a square root at each step, but the 
Rayleigh quotient estimate (7) of the eigenvalue becomes, since ||y v || 2 = 1, 


K + i) 2 


y v *Ay v 

yv*y v 

yv*5v + i 


The ratio estimate (4) is now computed as 


The convergence test (10) now takes the form 

|[5v + i - oryvla < c||5v + i||a. 

If the normalization (14b) is employed the computations for this test can 
be simplified to 

« + 1) 2 + * 2 - 2<7?* +1 y v < eVv + i) 2 - 


In any event, the convergence rates of (5) or (9) still apply in the appro¬ 
priate cases as do the error estimates (11) or (12). 

The power method as presented here is frequently adequate for approxi¬ 
mating a simple principal eigenvalue and eigenvector. In the event that 
the principal eigenvalues are X 1 = A 2 , i.e., complex conjugate, but simple, 
then the numbers A ly A 2 will be approximated by the roots of a quadratic 
equation found by examining three successive iterates x n , x n + 1( and 
x n + 2 (see Problem 6). Similarly, if the principal eigenvalue has a known 
multiplicity, the scheme for approximating X 1 can be suitably modified 
(see Problem 7). 

Of course, if the matrix A has no zero components, the operational 
count for each iteration (I) is n 2 , in general. We now turn to modifications 
of the power method which improve its rate of convergence. 
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2.1. Acceleration of the Power Method 

Convergence to the principal eigenvalue by the power method has been 
shown to be geometric with convergence factor |A 2 /A x | (or by using the 
Rayleigh quotient for a symmetric matrix, the eigenvalue convergence 
factor is ^/A^ 2 ). This ratio is frequently too near unity for practical 
computations. Thus we seek modifications, analogous to those employed 
in Chapter 2, Section 5, and Chapter 3, Subsection 3.3, to accelerate the 
convergence. 

Assume A to be symmetric with unique principal eigenvalue A t and con¬ 
sider the power method for the modified real symmetric matrix 

(15) B = A + al. 

The eigenvalues /q of B are 

Pi = K + a, i — 1,2,.. n. 


If Hi is the unique principal eigenvalue of B , the rate of convergence is now 
determined by 


max 

j* i 


A ; + a 

Aj + a 


We minimize this ratio with respect to a and find, if the A t are now ordered 
by 

Aj > A 2 > ■ • - > A„ 
that the optimal value of a is 
(16) 


(or Ai < A 2 < - * • < X n ), 
A 2 + A n 


The proof is a simple modification of that of Theorem 5.1 in Chapter 2. 

In order to apply this improvement, estimates of A 2 and A n are required. 
Such estimates may require auxiliary computations which is one of the 
disadvantages of the proposed acceleration procedure. For example, 
if a crude estimate a of A : is known (obtained perhaps by the ordinary 
power method) then the principal eigenvalue of C = A — of will be 
A n — a. Thus the power method applied to C will yield an estimate of A n . 
Good estimates of A 2 are more difficult to compute. However, if u is an 
approximation to u 1# then using any x 0 such that x 0 *u = 0 as the initial 
vector in (1) for a few iterations may yield a reasonable estimate of A 2 
(assuming, of course, that |A 2 | > |A„|). The reason is that, if x 0 is almost 
orthogonal to u 1( then a x in (3) will be quite small. Whence, for appro¬ 
priately “small” values of v, we may have lA^^I « |A 2 v a 2 | and the ratio 
x ktV + 1 /x kv will be a better approximation to A 2 than to A 2 . 
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On the other hand, if we are given a and j8 such that, for example, 

(17) a < A n < • • • < A 2 < p < A 1# 

then more efficient acceleration schemes for finding Aj can be devised. 
Let P m (\) be a polynomial of degree m, say 

(18) P m ( A) = c 0 A m + ^A-" 1 + . • • + c m . 

By P m (A) we indicate the matrix which is the corresponding polynomial 
in the matrix A. In Problem 10, it is shown that the eigenvalues of P m (A) 
are {/ > m (A 4 )}. We now consider the power method applied to the matrix 
P m (A) with the initial vector x 0 of (2). In place of (3), we now get after r 
iterations 

X, = P m {A)x x . l 

(19) = P m \A)x 0 , 

x, 1 

To evaluate x„ we do not actually form the matrix P m (A) but recursively 
compute the vectors Ax t - U ..., A m x x -i* Thus the number of operations 
performed in one iteration of (19) is equivalent to that for m iterations of 
(1). We then can compare the convergence rate by examining 


max 

1*1 



and 


max 

i*i 


Pm( Ax) 


In fact, the “best” polynomial (IB) of degree < m to employ is that for 
which max |P m (z)/i > m (A 1 )| is a minimum on a < z < fS. This problem has 

z 

previously been met in Chapter 2, Section 5, in a similar context. The 
determination of this best polynomial is described in Chapter 5, Section 4, 
where we study the Chebyshev polynomials. 

When the iterate x T has been computed the approximations cr T + 1 or 
a' J + 1 are formed as before by using Ax x and x T . The convergence test (10) 
may still be employed. 

The convergence of the eigenvalue and eigenvector iterates can also be 
improved by the 5 2 -process described in Chapter 3, Subsection 2.4 (see 
Chapter 3, Problems 2.6, 2.7), when A has a unique principal eigenvalue 
and a complete set of eigenvectors (see Problem 5). 


2.2. Intermediate Eigenvalues and Eigenvectors (Orthogonalization, 
Deflation, Inverse Iteration) 

In the previous section, we have indicated procedures whereby the power 
method could be modified to obtain estimates of A 2 and A n . The careful 
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application of these methods can be made to yield accurate approximations 
to several eigenvalues and their eigenvectors. In principle, all the values 
(of a real symmetric matrix) could be determined by these procedures, 
but in practice much accuracy may be lost in the later stages of the process. 

Assume that A is symmetric, with principal eigenvalue A ls and that the 
ordering is 


Ai > A 2 > • • * > A n _ x > A n . 

The matrix A — cr/ has eigenvalues A y — o and, for real <j, has principal 
eigenvalue either A a — a or A n — a (since the above ordering is preserved). 
If a > X 1 then \ n — a is the principal eigenvalue of A — ol and by the 
ordinary power method A n and u n can be accurately approximated. (Note 
here that Theorem 1.1 provides bounds for {AJ, whence such a a may be 
obtained.) 

However, this device cannot readily be used to yield the intermediate 
eigenvalues. 

The orthogonalization method is suitable for determining intermediate 
eigenvalues and eigenvectors and will be described next. Once u x has been 
accurately determined, we may form a vector x 0 which is orthogonal to u x . 
Such a vector has the eigenvector expansion (2) in which a x = 0. For 
example, if z is any vector then 

(20) X 0 = Z - Ui 


has the property, Xo*^ = 0. Now the sequence x v is formed and used 
to compute A 2 (assuming |A 2 | > |A ; |, j = 3, 4, ..., ri) and u 2 . However, 
after several iterations, roundoff errors will usually introduce a small but 
non-zero component of u x in the x v and subsequent iterations will magnify 
this component. This contamination by roundoff may be reduced by 
removing the u x component periodically; i.e., say after every r steps x 0 
is recomputed by using x r in place of z in (20). When A 2 and u 2 have been 
determined in this manner the procedure can in principle be continued to 
determine A 3 and u 3 . 

In general, if \ and u t are known for / = 1, 2,. .., s then we form, for 
arbitrary z, 


( 21 ) 


2 U i L 

u i* 

t = i u i u i 


n 

Since the u t are orthogonal, with z = ^ a j u v we t ^ at 

y = i 


x 0 = a s + 1 u s + 1 + • • • + a n u n , 




154 COMPUTATION OF EIGENVALUES AND EIGENVECTORS 


[Ch. 4) 


and the power method applied to this x 0 will yield A s + 1 and u s + 1 . The 
roundoff contamination is more pronounced for larger values of s and 
hence (21) must be frequently reapplied with z replaced by the current 
iterate, x v . 

If the matrix A were not symmetric, but did have a complete set of 
eigenvectors, then the corresponding biorthogonalization process could 
be carried out. For example, the unique principal left and right eigen¬ 
vectors V! and u x could be found: Ui by the iteration scheme (1); and v x 
by the scheme, with y 0 arbitrary, 

(1*) y v * = y*_i A, v = 1,2,.... 


The unique next largest eigenvalue A 2 could be found by selecting, for z 
arbitrary, 


(22a) 


x 0 = z 


V!*Z 


Vi, 


or 

(22b) 


y 0 = z 


Ul*Z 

Ul*Ul 


Ul 


and then applying the power methods (l) or (1*) respectively. The vectors 
{x n } and {y n } are orthogonal to y 1 and Ui respectively (see Problem 8). 
Of course, in practice, the effect of rounding errors would have to be 
removed by periodically re-biorthogonalizing the vectors {x n } and {yj. 
To Problem 9, we leave the development of the analog of (21) for the 
determination of A s + 1 , u s + 1 , and v s + i. 

A method for roughly approximating A 2 and u 2 for the symmetric matrix 
A is motivated as follows. With (1), (2), and (3), we define 

n 

(23) z v = x v - A^ii! = 2 * = 1,2. 

y = 2 


By taking ratios of, say, the kth components of z v and z v + 1 we have, 
assuming |A 2 | > |A 3 | >•••> |A n |, 


z k, v-f l 
z kv 


= A 2 + 0 



Similarly, by forming the Rayleigh quotient we have 


a 


z * 
Z v 


Z v + 1 

*z 


A 2 + (9 



Also as v ->oo the direction of z v converges to that of u 2 . 
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Unfortunately, the process defined by (23) with 


and 

ih = lim x B , 

n-* oo 

has the feature that many significant figures are lost in (23) as v gets large. 
But these values a v + 1 or <t' + 1 are good approximations of A 2 only for large 
v. Hence we must choose v judiciously in order to get a reasonable estimate 
of A 2 and u 2 . This procedure can be readfiy adapted for the sequence of 
normalized iterates {j> v } defined in (13). It should be noted that the vectors 
{x v } (or {y v }) which are required have already been computed in determining 
Aj and u x . Thus, with little extra computation, we have found a rough 
approximation for A 2 and u 2 . 

These two procedures removed components of the known eigenvectors 
from the iterated vectors. However, it is also possible to alter the real 
symmetric matrix A so that the known eigenvectors then correspond to 
zero eigenvalues. Iteration on an arbitrary vector with this altered matrix 
then automatically eliminates the known components. Thus, suppose 
Ui and X 1 are known and that Ux is normalized by ||u x || 2 2 = Ux*Ux = 1. 
Then we may form the matrix A 1 as follows 

(24) A x = A — AiUiUi*. 

Since A is symmetric, so is A x . We also note that A x u x = o. For any other 
eigenvector, u ; , belonging to an eigenvalue, A y , j > 1, it follows from the 
orthogonality of the eigenvectors that A^ = A y u y . Thus A x has all the 
eigenvectors of A and all its eigenvalues except A x which is replaced by 
zero. A simple calculation and proof by induction reveals that (see Problem 
10 ) 

z v = A x z = A v z - Ax v UiUx*z 

(25) 

= A v z — Ax v (ui*z)ui. 

A comparison of this result and the sequence {x v } with x 0 given by (20) 
shows that, for exact computation with normalized eigenvectors, the present 
method is exactly equivalent to the orthogonalization method. The cancel¬ 
lation errors of (23) do not occur now but instead an error grows due to 
the fact that the \ x and tii employed are not an exact eigenvalue and eigen¬ 
vector respectively. Thus the computed A x does not satisfy (25) exactly. 
However, iterations with A x are usually more accurate than the more 
economical computations in (23). 
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If the matrix A were sparse (many zero elements), then we would not 
recommend using (24) since it would, in general, produce a full matrix A x . 
When A 2 and u 2 have been determined from A lf the procedure can be 
repeated by forming A 2 — A x — A 2 u 2 u 2 *, etc., to determine the remaining 
eigenvalues and eigenvectors. 

In the above modifications of the matrix A , the resulting matrices A k 
are still of order n. It is, in principle, possible to successively alter the 
matrix A , so that the resulting matrices B k are of order n — k, k — 1, 
2,. . ., n — 1, and have the same remaining eigenvalues. Such procedures 
are called deflation methods , by analogy with the process of dividing out 
the roots of a polynomial as they are successively found. For example, 
the simplest such scheme is based on the method used in Theorem 1.1 of 
Chapter 1 to show that every matrix is similar to a triangular matrix. 
(The deflated matrices are the A k defined there.) 

We now describe another scheme, which has the additional feature that 
when the matrix A is Hermitian, the deflated matrices are also Hermitian. 
Let Au = Au, u*u = 1, u x > 0; set 

(26) P — I — 2 coco*, 
where to is defined, with e x = (8 tl ), by the properties 

to = (<O t ), 

(27) to*to =1, a*! > 0, 

P e x = u. 


In Problems 11 and 12, we show that P is unitary and that the components 
(a»i) of to are defined by 



Now it is easy to see that 
whence 





APe x = APet, 


k = 2, 3,.. n. 


(29) p-'APe^ = Ae,. 

Equation (29) shows that e x is an eigenvector of 


(30) 


B x = P'AP = PAP. 


Therefore, the first column of B x is Ae^ In other words, A has been de¬ 
flated. f This process can be continued with the matrix A x of order n — 1 


t In practice the evaluation of B k could be performed economically by adapting the 
procedure described in equations (3.16*). 
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consisting of the last n — 1 rows and columns of B Let the matrix, Q , 
of order n - 1 be of the form (26), and satisfy (27) and (28) relative to 
the matrix A 1 with an eigenvalue fi and eigenvector v (of order n — 1). 
Then set 



where o is of order n — 1. It is easy to verify that P is unitary, in fact, 
P 1 ~ 1 — Pi* = Pi- Hence the matrix 

B 2 = P x PAPPi 

A O 2 #3 * * * 

0 fi b 3 • • • b n 

0 0 

: : A 2 

0 0 

where A 2 is of order n — 2. This process may be continued to provide a 
proof of 

theorem 1 (schur). The matrix A , of order «, is unitarily similar to a 
triangular matrix. 

Proof Left as Problem 13. ■ 

corollary. The Hermitian matrix A is unitar ily similar to a diagonal 
matrix. ■ 

Finally, we describe another iteration scheme for determining the inter¬ 
mediate eigenvalues and eigenvectors. This procedure is called inverse 
iteration and is based upon solving 

(31) (A (r/)x n — x n _!, n — 1,2 ,..., 

with x 0 arbitrary and a a constant. 

Clearly, (31) is equivalent to the power method for the matrix ( A — al ) _1 . 
We may use Gaussian elimination and (31) to calculate x n . 

Of course, the procedure will produce the principal eigenvalue of 
(A — a/) -1 , i.e., l/(A fc — a), where 

|4 - a| = min |A, - o\, 

provided that a is closer to the simple eigenvalue \ k of A than to any 
other eigenvalue of A. Each iteration step, after the first triangularization 
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of A — al , requires about n 2 operations. The first iteration step requires 
about n 3 j 3 operations [see Chapter 2, equation (1.9)]. The inverse 
iteration method is very useful for improving upon the accuracy of an 
approximation to any eigenvalue. 

Other iteration schemes based on matrix transformations have been 
devised to approximate the normal forms given in Theorem 1 and the 
corollary. We treat such schemes in the next section. 


PROBLEMS, SECTION 2 

1. Prove that a real square matrix A of order n is symmetric iff it has n 
orthogonal eigenvectors. 

2. Let A , y, and a be real. Show that the scalar a such that ay is the best 
mean square approximation to the vector Ay is given by 

_ \ y*(A + A*)y _ y*Ay 
° ~ 2 y*y ~~ y*y 

[Hint: (y*A*y) = (y*Ay)*; hllz 2 = i 

3. Show that if A is Hermitian, then with A = S + /AT, where S is real 
symmetric, K real skew-symmetric, the eigenvector u = x + iy and eigenvalue 
A satisfy 



Verify that if A is a simple eigenvalue of A , then A is a double eigenvalue of the 
compound matrix. 

4. The power method for computing left eigenvectors is based on the 
sequence 

z v * = zj_i A = z 0 *A\ v = 1,2,- 

Then, with the use of the sequence (1), we may approximate A x by 


Show that, if the matrix A has a complete set of eigenvectors and a unique 
principal eigenvalue, then a" converges with the ratio lAa/A^ 2 . (Note, however, 
that twice as many computations are required to evaluate each iteration step 
here.) 

5. Use Problems 2.6 and 2.7 of Chapter 3 to find the improvement in eigen¬ 
values and eigenvectors obtained by applying the S 2 -process to cr v + 1 of (4) 
or cr' + i of (7), when A is real symmetric and has a unique principal eigenvalue. 

6. Show how to find the coefficients 5, p of the quadratic equation 
A 2 — jA + p = 0, that is satisfied by the unique complex conjugate pair of 
simple principal eigenvalues A t , A 2 of the real matrix A. 

[Hint: Given A x = A 2 ; |A X | = |A 2 | > |A,|, / = 3, 4,. . n\ assume that for 
the corresponding eigenvectors Ui and u 2 = Ui, the respective first components 
«ii and u i2 are maximal. Then apply the technique of Bernoulli’s method. 
Chapter 3, equations (4.16)-(4.18).] 
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7. When the maximal eigenvalue has multiplicity m , how can the power 
method be used? (See Chapter 3, Problem 4.5.) 

8. Verify that the sequences {x n } and {y n } defined by (1), (1*), and (22) are 
orthogonal to Vi and u x respectively. That is, show 

Vi*x n = u x *y n = 0. 

[Hint: Use induction with respect to «.] 

9. Develop the biorthogonal analog of (21), for the case that A has a com¬ 
plete set of eigenvectors. Describe the power method for the determination 
of the unique intermediate eigenvalues when |A x | > |A 2 | > • * * > |A n |. 

[Hint: Generalize (22) and use Problem 8.] 

10. Prove (25), i.e., if A is real symmetric, Aw = Au, and ||u|l 2 = 1, then for 
all z and v = 1,2,..., 

(A — Auu*) v z = A v z — A v (u*z)u. 

11. Prove that the Hermitian matrix 

P = / - 2ww* 

is unitary, in fact P 1 = P* = P, iff 

(jt)*W = 1. 

12. With 


U ! > 0, U*U = 1, 


show that if the matrix P of (26) satisfies (27), then 



13. (a) Give a complete proof by induction on n of Theorem 1 (Schur), 
along the line indicated in text. 

(b) Give another proof of Theorem 1 by making use of Theorem 1.1 
of Chapter 1. 

[Hint: Construct B = P~ l AP where B is triangular. Since P is non-singular, 
construct a matrix Q whose columns are an orthonormal set of vectors formed 
successively from the columns of P so that P = QR , where R is upper tri¬ 
angular and non-singular. Therefore RBR 1 = Q~ X AQ . Show that the prod¬ 
uct of two upper triangular matrices and the inverse of an upper triangular 
matrix are upper triangular matrices.] 



3. METHODS BASED ON MATRIX TRANSFORMATIONS 

The methods of Section 2 are suitable for the determination of a few 
eigenvalues and eigenvectors of the matrix A. However, when we are 
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interested in finding all of the eigenvalues and eigenvectors then the 
methods based on transforming the matrix A seem more efficient. Attempts 
to implement Schur’s theorem, by approximating a unitary matrix H 
which triangularizes by similarity the given general matrix A , have been 
recently made. We shall describe one of these later. In the case of a Hermi- 
tian matrix A , however, the classical JacobVs method does work well to 
produce a unitarily similar diagonal matrix. The methods of Givens and 
Householder produce either a similar tridiagonal matrix for a Hermitian 
A or a similar matrix of Hessenberg\ form for a general matrix A , with the 
use of a unitary similarity transformation. 

We will first describe Jacobi’s, Givens’, and Householder’s methods for 
treating the Hermitian matrix A. In this case, since the computational 
problem in practice is usually reduced to that of finding the eigenvalues 
of a real symmetric matrix (see Problem 2.3), we will assume that A is 
real symmetric. All of these methods, however, require only simple 
modifications to be directly applicable to the Hermitian case. Jacobi’s 
method reduces the real symmetric matrix A by an infinite sequence of 
simple orthogonal similarity transformations (two dimensional rotations) 
to diagonal form. The following lemmas provide the basis of the procedure. 

lemma 1. Let B = P~ 1 AP . If P is orthogonal {unitary), then 
trace ( B ) = trace {A), 

(i) 

trace ( B*B ) = trace (A*A), 

Proof Since B and A are similar matrices, the eigenvalues of B are 
the same as the eigenvalues of A. By definition, 

n 

trace (A) = 2 a u- 

i= 1 

The eigenvalues of A are the roots of 

p A { A) = det (A I - A) = 0. 

By partially expanding the determinant, the coefficient of A n in p A {\) 
is seen to be unity while the coefficient of A n_1 is —trace (A). Hence, we 
find that 

n 

trace {A) = ^ K = trace (B). 

i = i 

The orthogonality of P , i.e., P* = P' 1 , implies 

B*B = ( P*A*P)(P*AP ) = P*A*AP . 

Again, because the eigenvalues are unchanged under a similarity trans¬ 
formation, the trace of A*A is unchanged. ■ 

f See the definition of (upper) Hessenberg form in Problem 1.7 of Chapter 2. 




[Sec. 3] 

LEMMA 2. Let 


METHODS BASED ON MATRIX TRANSFORMATIONS 161 


W d„) 


dij 0 , 


be formed from elements of a real symmetric matrix A and let 

(2a) R=( r “ r f 

Va r a) 

with r ti = r yj — cos <f> 9 r xj — —r yi — sin <f >, where 

~a tj 


(2b) 

Then 

and 


tan 2(f> — 


djj — da 

R* = 




R- 


B {j = R*A iy R 


is a diagonal matrix . 

Proof Let 

_ (bn b t A 

w = U„ bj 

Equation (2), by Problem 1, guarantees that b {j = b yl = 0. ■ 

By Lemma 1, since A iy and B iy are orthogonally similar, 

(3) b H 2 + bj 2 = a u 2 + 2a iy 2 + a,/ 2 . 

We say that the matrix R reduced to zero the element a iy . We now con¬ 
struct an orthogonal matrix of order «, to reduce to zero the element 
a iy of A . Let 

(4) p E (p st ) 

where 

Pa — Pa ~ r u> Pji ~ r jty Pa “ 

At = otherwise. 


and 


With the elements r ih r iy , r ;f , r j} defined in (2), P*P = PP* — /. 
We call such a matrix P a two dimensional rotation . Now set, 


^ s J>MP = (bt,). 

Clearly, b iy = b n — 0. Hence the matrix P reduces to zero the element 
a iy of A. We now can show that P reduces the sum of the squares of the 
off-diagonal elements of A. From the definition of trace (A*A), it is easy 
to verify that 

n 

trace (A*A) = 2 % 2 - 

t. / = 1 
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Now, by Lemmas 1 and 2 and equation (3), we find, since b kk = a kk for 
k / ij, 

(5) 2 b ™ = trace ( B*B ) - 6 SS 2 

r*s s — 1 

= trace - 2 - 2a w z 

S = 1 

= 2 ~ 2a u 2 - 

r 


If we pick (/, y) so that = max {a ks 2 }, then 

k *s 


( 6 ) 


2 a « 2 
fc _ 

(n 2 - n) 


trace (A*A) — 
n 2 - n 


n 



In fact, (6) is satisfied by any a tj 2 that is not less than the average of the 
squares of the off-diagonal elements of A. 

By substituting the inequality (6) in (5), we have 


(7) 




a 


2 

rs * 


Therefore, each two dimensional rotation defined in (2) and (4), and such 
that (6) holds, reduces the sum of the squares of the off-diagonal elements 
of a symmetric matrix A by a factor not greater than 


1 - 



< 1 


for n > 2. 


In addition, we observe from (3) and the phrase preceding (5) that 

( 8 ) 2 b °° 2 = i + 2 <- 

S=1 S — 1 

We therefore have the basis for proving 

theorem 1 (jacobi). Let A be real symmetric , with eigenvalues {A t }; 
let the matrices {P m } be two dimensional rotation matrices of the form (4) 
defined successively so that an above-average element of 

B 0 = A 

is reduced to zero by P u and thereafter P m reduces to zero an above-average 
element of B m _ 1 in the similarity transformation defining B m , 

m = 1, 2,.... 


B m = P m *B m .,P m , 
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Let 

(9) Q m P m . 

Then , as m —>co, 

Q m *AQ m -+ A, 

where A = (A t S iy ), for some ordering of the (AJ. 

Proof For m sufficiently large, by (7) and Gerschgorin’s theorem 

B m s Q m *AQ m 

is approximately the diagonal matrix A, for some ordering of {At}. Now, 
the angle of rotation <j> , determined for P m + i from B m , is close to zero, 
unless the two diagonal elements of B m used to determine </> correspond to 
a pair of identical eigenvalues. Hence it is easy to verify that 

Q m *AQ m ^A. ■ 

Jacobi's method is the scheme in which P k + 1 is determined so as to 
reduce a maximal off-diagonal element of B k to zero, k = 0, 1, 2,.... 
In practice, by listing the magnitude and column index of the maximal 
off-diagonal element in each row of B kJ the Jacobi scheme is easily carried 
out. That is, the only elements that change in going from B k to B k + 1 are the 
elements in the rows and columns of index i or j. Hence, the list of maximal 
elements in each row of B k + 1 is easily constructed from the list for B k 
by making at most 2 n — 1 comparisons. A common variation of the Jacobi 
method consists in examining the off-diagonal elements of B m in a sys¬ 
tematic cyclic order given by the indices (1, 2), (1,3),..(1,«), (2, 3), 
(2, 4),.. ., (2, «), (3, 4),— 1, ri). The indices (/, j) used to determine 
P m + 1 correspond to the first element b™ of B m that satisfies 

where {/ m } is a prescribed decreasing sequence of positive numbers called 
thresholds. Such an iteration procedure is called a threshold scheme . 
A bolder approach, namely to rotate the off-diagonal elements in sequence, 
irrespective of size, is called the cyclic Jacobi scheme. Surprisingly enough, 
with only a minor change in the definition of the angle 0, when <f> ^ ±7t/ 4, 
the cyclic Jacobi scheme has been shown (Forsythe-Henrici) to be con¬ 
vergent also. In fact, if a comparison of the residual off-diagonal sum of 
squares, ^ m ade after each complete cycle of q = (n 2 — n)/2 

r*s 

rotations, then it has been shown that the cyclic Jacobi scheme converges 
and, in fact, that it converges quadratically for m large enough , i.e., 

2 wr a) ) 2 <k\% m>rV 

r*s Lr*s 


for m large enough. 
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In these Jacobi schemes, the eigenvalues of A are approximated by the 
diagonal elements of B m , for sufficiently large m. Furthermore, the corre¬ 
sponding eigenvectors of B m are then approximately the unit vectors {e f }. 
Hence, the eigenvectors of A are approximated by the columns of 

Qm “ P1P2’ ' ' B m . 

Since only two columns of Q m are changed in going from Q m to @m + i> 
only 4 n multiplications are involved in this step. On the other hand, the 
elements of the matrix B m that are unaffected by the rotation P m + 1 are 
those that have indices (r, 5) with r # i,j and s # /, j. Therefore, because 
of symmetry, approximately 4 n multiplications are needed to carry out 
the rotation of B m into B m + 1 (if we neglect to count the square root 
operations necessary to determine cos <j> and sin <j>). Hence, about 8« 
multiplications are required to determine B m + 1 and Q m + i from B m and Q m . 

We now consider the Givens transformation. Here a finite sequence of 

_ (n - 2)(n - 1) 


rotations are employed to reduce the real symmetric matrix A to tri¬ 
diagonal form. 

That is, we successively construct a sequence of matrices {P fc }, k = 1, 
2,..., M y and define 


( 10 ) 


Bo = > 4 , 

Bk = Pk*B k -iP k , \ < k < M. 


The matrices {P m } are two dimensional rotations of the form (4) con¬ 
structed so that the first k not-codiagonal elements of B k are zero. That is, 
we say {a n } and {a it i + 1 } are the codiagonal elements of A. For a symmetric 
matrix, we refer to the not-codiagonal indices listed cyclically in the order 

(11) (1, 3), (1, 4),..., (1, n\ (2, 4), (2, 5),..(2, «),...,(«- 2, n). 

The first k not-codiagonal elements of B k are the elements of B k whose 
indices are among the first k in the list (11). The facts are summarized in 


theorem 2 (givens). Let A be real symmetric; B 0 = A. Let (i — 1,/) 
be the kth pair of indices in the cyclic sequence (11), and B k _ 1 have elements 
(6 ( 4 -1) )- Let P k = /, if = 0; otherwise , set P k = (p^) with 


pi? = pV 


b\1-\}} 


( 12 ) 


pV 


-n (k) — 

Pit ~ 


V(ri!!) 2 + l.T 


pV = s r; 


for other ( r , s). 
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Let the matrices {B k } and {P k } be defined by (10) and (12) for k = 1, 

(n-2)( W -l) 

2 

Then, B k is real symmetric; the first k not-codiagonal elements of B k are 
zero , /c = 1,2,..., M; B M is tridiagonal. 

Proof It is a simple matter to verify that if the first k — 1 not- 
codiagonal elements of B k _ 1 are zero, then the corresponding elements of 
B k also vanish. [This property of preserving zeros is not valid in general 
for the rotations of the type used in Jacobi’s method! In Jacobi’s method 
the matrix a two dimensional rotation in the (i,j) coordinates, an¬ 
nihilated the (i,j) element; but in Givens’ method the matrix P k anni¬ 
hilates the (z — l,y) element]. Furthermore, by Problem 2, the definition 

(12) of P k ensures that b\ k 2 1 , f = b ( f \_ l = 0. Hence, by using mathematical 

induction the proof may be completed. ■ 

Aside from the calculation of the non-trivial elements of p^ the cal¬ 
culation of the non-zero elements of B k in (10) involves approximately 
4 (n — i ) multiplications. Now, in order to reduce to zero the elements in 
row 1 — 1 , this process must be carried out for j — i + 1, i + 2,.. n. 
That is, 4 (n — z) 2 multiplications are required to put zeros in all of the 
not-codiagonal elements in row i — 1. Therefore, for the complete reduc¬ 
tion to tridiagonal form, we have the 

corollary. The Givens method requires 

§?j 3 + 0(n 2 ) operations 

to transform the real symmetric matrix A to tridiagonal form. ■ 

We shall complete the description of Givens’ method for determining 
the eigenvalues and eigenvectors after we study Householder's method for 
reducing the matrix A to tridiagonal form. Householder’s scheme uses a 
sequence of n — 2 orthogonal similarity transformations of the form 

(13) P = / — 2 toco*; io*io = 1, 

with suitably chosen vectors to. In Problem 3, P*P — PP* — I is verified. 

We now describe how the matrices P k are defined. Let the rows i — 1, 
2,..., k — 1 < n — 3 of the symmetric matrix B k _ 1 have the reduced 
form 


b rs = 0 for 1 < r < k — 1 and r -\- 2 < s. 
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If 

set 

if 

set 

(14)* 

with 

where 


and if b ktk + 1 ^ 0, 


2 ft.*** = o. 

t = 2 

p * * /; 

n-k 

2 ^k.k+t ^ 0 > 
t — 2 

P k — I — 2 to to*, 

<*> = P*> y = (y*), 

Vi = 0, i — 1,2,.. 

v k + 1 = 2 y 2 S, 

Vi = b kt , i — k + 2, k + 3,. 

s 2 = *2 **.*+<, 


5 = sign (6 fc .* + i)Vs a , 

(^fc,fc+l + S) o 1 

7 = - 2K -’ P = 2^5’ 




2 K 2 = S 2 + 6 fcik+ 1 S. 

We then have 

theorem 3 (householder). Let A be real symmetric. Set B 0 = A, define 

(15) fc B k = P k *B k -iP k , k = l, 2,— 2, 

means of (14) fc . Then to*u> = I; B k is real symmetric; all of the not- 
codiagonal elements of B k , in the rows i — 1, 2, .. k, are zero; Z? n _ 2 Is 
tridiagonal. 

Proof We leave the verification to Problem 4. ■ 

Now, we note that the practical evaluation of B k in (15) fc can be carried 
out in the following fashion: 


B k = P ) fB k . 1 P k 

— (I — 2iOio*)B k _ 1 (! — 2toto*) 

= B k . 1 — 2tot o*B k -! — 2B kl ixnxi* 4- 4toto*i? /c _ 1 toto*. 
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Therefore, 
( 16 ), 
where 

with 


Bic — B k - 1 — 2£ 2 (vu* + uv*), 

u = % — ay, 

5 = B k . x y, 
a = /5 2 v*£. 


Observe that in (16) fc , K need not be evaluated, and therefore only one 
square root operation is required in each stage of Householder’s method. 
Furthermore, given v, the evaluation of 5 requires 

(n — k)(n — k — 1) multiplications; 

of a requires 

n — k multiplications; 

of uv* requires 

( n — k) 2 multiplications. 

Hence we have shown that the number of multiplications required to 
produce B k is approximately 2 (n — k) 2 . 

corollary. Householder's method reduces a real symmetric matrix to 
tridiagonal form with the use of §« 3 + (P(n 2 ) multiplications. 

Proof . The result follows from the formula 

V 2 n 3 

2 k 2 = %r + 0{n 2 ). m 

fc = l * 

We now remark that both Givens’ method and Householder’s method 
can be employed to reduce any real matrix A to lower Hessenberg form. 
The matrix B = (b^) is in lower Hessenberg form iff b itS = 0 for 
i + 2 < s < n. 

Finally, we give the treatment of Givens for finding the eigenvalues of 
the symmetric tridiagonal matrix B. 

Let B = (bu) be real symmetric and tridiagonal, i.e., 


Set 

(17) 

Recall that 


bn = b jh 1 < ij < n; 

bij = 0, / # i — 1, /, / + 1; 1 < i < n. 

a t = b iU 1 < i < n; 
c i — bi' i + 1 — b i + l i , 1 < i < n — 1. 


p B ( A) = det (A/ - B). 
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Givens observed that the principal minor determinants of XI — B 
formed a sequence of polynomials having properties similar to those of a 
Sturm sequence (see Chapter 3, Subsection 4.2). That is, let 

Po( A) s 1, 

PiW = (A - a x \ 

(18) p 2 ( A) = (A — a 2 )( A - ai ) - c x 2 , 


Pi( A) - (A - a t )Pi- i(A) - (c l _ 1 ) 2 /?i_ 2 (A); i = 3, 4,.. n. 

In Problem 5 it is shown that p n ( A) = p B ( A), If any c { — 0, then the deter¬ 
mination of the eigenvalues of B is reduced to the eigenvalue problem for 
a smaller matrix. Hence we assume c { =£ 0, 1 < i < n — 1. We have then, 

theorem 4 (givens). Let the tridiagonal , real symmetric matrix B be 
defined by (17), with all c { ^ 0. Then the zeros of each p { ( A), i = 2, 3,..n, 
are distinct and are separated by the zeros of pi- X (X); and\ if p n {y) # 0, the 
number of eigenvalues of B that are greater than y is equal to the number 
of sign variations in the sequence p n {y), p n ~i(y)i ■ • Pi(y\ 1- 

Proof Since c t + 0, no two successive polynomials /?*(A) and /^^(A) 
can have a common zero. Otherwise, from (18), Pi~ 2 ( A),. -^o(A) would 
also have that zero. By mathematical induction, we can now prove the 
separation property. That is, a simple plot of p 2 ( A) shows that the simple 
zero of P\{X) separates the two simple zeros of p 2 {X). Assume that the 
i — 2 simple zeros of /? t _ 2 (A) separate the i — 1 simple zeros of />i_i(A). 
Now, from (18), at each zero of _ x (A), the sign of /? t (A) is opposite to the 
sign of />t_ 2 (A). But, by the induction hypothesis, Pt- 2 ( A) changes sign 
between each pair of neighboring zeros ofp t _ Therefore, p { ( A) also changes 
sign and hence has a zero between each neighboring pair of zeros of 
-i(A). Now 

A(+00) = +00, pi- oo) = (-iyoo, i = 1,2,.. .. 

Therefore p t ( A) has a zero to the right of the largest zero of p i _ 1 (A) and a 
zero to the left of the smallest zero of pi-i( A). On the other hand, p t ( A) 
can have no more than i zeros. Therefore, we have shown that the i — 1 
simple zeros of /? i _ 1 (A) separate the i simple zeros of /? f (A). This separation 
property is all that is needed to verify the rest of the theorem’s conclusion, 
by the kind of argument we used for treating Sturm sequences in Chapter 3, 
Subsection 4.2 (see Problem 6). ■ 

The evaluation of the characteristic polynomial of B , once we have 
calculated {c { 2 }, is then carried out by using (18). Thus a sequence of In — 3 
multiplications is required for each determination of /?„(A). If all of the 
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eigenvalues of B are in the interval [a, A], then we may locate the eigen¬ 
values more precisely by halving the interval and using the theorem to find 
out how many zeros of p B { A) are in each half, i.e., pick y = {a + b)/2. 
In this way, after t halvings, we will know the location of an eigenvalue to 
within ±(a 4- b)2~ it + 1 \ at the expense of about 2 tn multiplications. 

Once an eigenvalue A of B is found, a corresponding eigenvector can be 
determined by using the fact that 


B - XI 

has rank < n — 1. 

If none of the off-diagonal terms c t vanish,t then the equations 

(B — A/)x = o 

may be solved simply by applying Gaussian elimination. That is, with the 
use of maximal column pivots, we proceed to eliminate x u x 2 , .. x n . 
We neglect the last r equations of the essentially triangular system that 
results, if the rank is n — r. We give arbitrary non-zero values to the vari¬ 
ables x n _ r + 1 ,..., x n , that appear in the last r equations; and solve the 
first n — r equations of the triangular system. If the maximal column 
pivots do not occur in order along the diagonal, we list the pivotal equa¬ 
tions in the order that we find them. In this case, the then resulting upper 
triangular matrix, U y may not be codiagonal. That is, the only non-trivial 
elements in row i may be the elements u iiy u i i + u and u iti + 2 . 

In Problem 8 we see that max \u if \ < 5b where b = max IAJ. Hence 

u u 

by the theory of a priori estimates in Subsection 1.2, Chapter 2, though 
the matrix B — XI is singular, the first n — r equations of U , when com¬ 
puted with finite precision, give an accurate representation of the coefficients 
of the unknowns x u x 2y . . x n _ T that would arise by exact elimination. 
Furthermore, the precise determination of r is not necessary! 

If x is an eigenvector of B corresponding to the eigenvalue A then 

y = Px, 

is the corresponding eigenvector of A , since 


B - P 'AP. 


For determining all of the eigenvalues and all of the eigenvectors of B , 
experience has shown that the QR method (see end of section) for the 
symmetric, tridiagonal matrix B is a more efficient procedure than Givens’ 
method. 


t If any c { = 0, the system splits into two disjoint systems. 
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We now turn our attention to the case of the general real matrix A. 
If we are interested in determining all of the eigenvalues of A , then a 
preliminary simplification to a similar Hessenberg form B is appropriate , 
since the iterative operations on B will then require fewer calculations. 
For example, as remarked after the corollary to Theorem 3, Givens’ 
or Householder’s orthogonal similarity transformations might be used to 
effect the reduction. 

Review the discussion, between Theorem 3 and its corollary, of the 
number of operations involved in using Householder’s method. Since the 
matrix B k _ x is not symmetric when A is not symmetric, we see that 
the operational count for the vector % becomes (n — k)n + (9{n ) while 
the other counts remain the same. Hence, the number of multiplica¬ 
tions required to reduce the general real matrix A to Hessenberg form is 
§ n 3 + (P(n 2 ) since (16)* is not valid. 

A convenient and practical technique to evaluate p B ( A), when B = (b u ) 
is in lower Hessenberg form, uses 

lemma 3 (hyman). Let B be in lower Hessenberg form. If b i l + x # 0, 
i = 1, 2 ,..n — 1, define the sequence of polynomials m x { A) 

m 0 = 1 

(19) 

-6,, l + 1 w, = b tl m 0 + b i2 m 1 + ■ ■ ■ + b u + ( b u - A)m,_ 1( 

i = 1, 2,..., n — 1. 

Then 

g B ( A) = det (B - XI) - (-l) n + 1 6 ia 6 a a- • A-i.ng(A), 

(20) 

#(A) s b nl m 0 + b n2 m 1 + • • • + *„.„- 1 m n _ 2 + ( b nn - A)w n . 1 . 

Proof Since 6 lti + i ^ 0, we can successively add multiples of the 
columns of B — XI to the first column in order to annihilate the first 
n — 1 elements of the first column. This process defines the polynomials 
{mj, i — 1, 2 ,..n — 1, and does not change the value of the determi¬ 
nant. But the («, 1) element is g(A) and the expansion of the determinant 
with respect to the elements of the first column results in (20). ■ 

Clearly, if any 6 lti + i = 0, the det (B — A/) can be written as the 
product of two determinants of submatrices of B — XI. 

In the case + i ^ 0, formulas (19) and (20) may be used to calculate 
g(A), with the use of n 2 jl + ®{n) multiplications. Similarly, by differentia¬ 
ting the formulas (19) with respect to A, a recursive evaluation of 
(or higher derivatives) is also simply carried out. Hence we may apply 
any of the standard iterative methods for finding the roots of the poly- 




[Sec. 3] 


METHODS BASED ON MATRIX TRANSFORMATIONS 171 


nomial g(A) (without evaluating its coefficients). A matter of considerable 
practical importance for the evaluation of the polynomial g(A) is the double 
precision accumulation of the inner products in (19) and (20). 

Another family of methods for finding the eigenvalues of the lower 
Hessenberg matrix A are called factorization methods. First, we note that 
C = A T and A have the same eigenvalues, but that C is of upper Hessen¬ 
berg form. The first factorization method we describe is Rutishauser's 
LR method. That is, the LR method consists in constructing (when 
possible) the factorization of the matrix C x s C in the form 


( 21 ) 

C 1 — LlRly 

where L is lower triangular (with unit diagonal elements) and R is upper 
triangular. Then Rutishauser considers 

( 22 ) 

L 1 ~ 1 C 1 L 1 = R X L, = C 2 . 

Next find 


(23) 

C 2 “ T 2 7?2j 

again a lower unit and upper triangular factorization. 

Now 


L 2 

x C 2 L 2 — L 2 1 L>i l CiL>iL 2 — R 2 L 2 = £ 3 * 

In general, a sequence of similar matrices {C k } is constructed and their 

L k R k factorization via Gaussian elimination is also found, so that 


II 

ft* 

P 

(24) 

c k + 1 “ R k C k L k 


— Lx' 1 - ■ L 1 ~ 1 C 1 L 1 - ■ L k . 

But if we define 

11 


Qk — Rk ' ' ' R-l- 

then 

P k C k+1 = C X P k y 

and therefore 

PkQk — Pk-\CkQk-\ — ^lPk-lQk-1 

(25) 

— C 1 2 P k - 2 Q k -2 


= Cf. 
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Hence, P k Q k is the LR factorization of C k . The fact that the matrix C k 
converges to an upper triangular form is shown under special assumptions 
(which may be weakened considerably). 

theorem 5 (rutishauser). Let C 1 X = XA , where A = (A t S i; ). Assume 
|Ai| > |A 2 | > •• • > |A n | > 0. Set 

Y= X" 1 . 

Assume X and Y can be factored in the form 

X = L X R X , Y = LyRy , 

where L x and L y are lower unit triangular and R x and R Y are upper tri¬ 
angular. Then with {C fc } defined by (21), we have the result: C k converges to 
upper triangular form. 

Proof {Wilkinson). Clearly, 

Ci fc = XA k Y = XA k L Y Ry 
= X(A k L Y A- k )(A k Ry). 

But by the strict inequalities satisfied by {A t }, the lower triangular matrix 
defined by 

A k L y A~ k - l^£ k ^ (e\ k) ) 

satisfies 

E k ->■ 0\ eff = 0, i=l,2,..., «. 

Therefore 

Ci k = L X R X (I + E k )A k Ry 

- L X (I + R x E k R x ~ 1 )R x A k R Y . 

But R x E k R x ~ 1 ^ O. Therefore, the LR factors of / 4- RxEk^x' 1 b°th 
converge to I. 

Hence, since R x A k R y is upper triangular, we see that the lower triangular 
factor of Cf, which by (25) is P k , converges to L x . That is, 

Pjc ^ L x . 

Hence L k = P k }iP k converges to I. 

But then, since 

L k 1 C k = R k , 

is upper triangular, it follows that C k must converge to upper triangular 
form. ■ 

It is easy to verify that the LR method preserves the Hessenberg form 
of the matrices {CJ. The LR method can be made to converge much more 
rapidly by introducing a shift 

D k — Cfc s k L 

and then continuing with the factorization of D k . 
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But we shall not pursue this avenue further. We merely remark that the 
LR method does not always work as well as we have described, because 
the LR factorization, even if possible, may give rise to ever increasing 
magnitudes of numbers. 

Another factorization method which seems not to suffer from the above 
defect is the QR factorization of Francis and Kublanovskaja. That is, the 
upper Hessenberg matrix C 1 is written 


whence 


Qi 1 CiQi — R1Q1 = C 2 , 


where Ri is upper triangular and Q x is unitary, i.e., 


We then factor 
In general, with 
we set 


q i* = er 1 . 

Q 2 = £?2^2* 

C)c — QkRk> 

Qk 1 C k Q k = C k+ i — RfcQk • 


All the matrices Q k are unitary and all the matrices R k are upper tri¬ 
angular. Again, the Hessenberg form of {CJ is preserved. 

Francis and Kublanovskaja have given proofs of convergence of C k 
to upper triangular form, in special cases. Wilkinson has given a simpler 
proof using techniques similar to those used in proving Theorem 5. 

An important feature of Francis' work is that he shows how to use real 
arithmetic and maintain the real Hessenberg form, even when some eigen¬ 
values are complex and the accelerating shifts indicated above (for the LR 
method) are complex numbers. Since he works with real numbers, when 
the eigenvalues are distinct but some are complex conjugate, in pairs 


\ \ + u 


then the matrices C k converge to a form that is not triangular (i.e., the 
limiting form has second order matrix blocks on the diagonal).f 

The QR factorization is unique when C x is non-singular. The Gram- 
Schmidt orthogonalization process is not recommended for carrying out 
the QR factorization. Rather, left-multiplications may be performed 
upon the matrix C u by unitary matrices of the form / — 2u>u>*, in order 
to successively reduce the columns of C t . That is, 

(/ - * ■(/ - 2u> 1 co 1 *)C 1 = R x . 


t These real matrices C k are real orthogonally similar and therefore can only be 
expected to converge to the Murnaghan- Wintrier canonical form rather than the 
Schur form of Theorem 2.1. 
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The matrix Q x is given by the product of the unitary matrices, 

Qi = (! - 2w 1 w 1 *)(/ - 2w 2 w 2 *). ■ •(/ - 2co n _ 1 w*_ 1 ). 

When C x is in Hessenberg form, these unitary matrices are simple two 
dimensional rotations. 


PROBLEMS, SECTION 3 

1. With A ih R , and B {j as defined in Lemma 2, show that b tj = b n = 0. 

2. Give the details of the proof of Theorem 2. 

[Hint: The only elements of B k - X that are transformed by P k are in the 
rows and columns with indices i or j. It is necessary only to examine the 
components (where j > i > 2) 

b ( £\ if s < i — 2; 

and 

b%\ if 1 < s < i — 1.] 

3. Verify that for any matrix P of the form (13), P* = P and 


P*P = PP* = /. 


4. Carry out the verification of Theorem 3. 

[Hint: Rows / = 1, 2, ...» k — 1 are unaffected by (15) fc . Check that row 
k is properly reduced.] 

5. Verify that p n {\) given by (18) satisfies p n { A) = p B (A). 

6. Complete the proof of Theorem 4. That is, use the hypothesis and assume 
the root separation property to prove the recipe for counting eigenvalues. 

7. What recurrence relation is satisfied by the functions {Pi'(A)}, where 
{ Pi ( A)} is defined in (18)? 

8. Let B be tridiagonal, of form (17); c, ^ 0, / = 1,2,...,/? — 1; and A be 

an eigenvalue of B. Use Gaussian elimination with maximal column pivots 
to solve ( B — A/)x = o. Let U = (u i} ) be the matrix of the resulting equi¬ 
valent upper triangular system. If max (|^|, |c/|) = b y then max |w 0 | < 5b 
(Wilkinson). i,J i,j 

[Hint: By Gerschgorin’s theorem |A] < 3 b. Use induction and examine the 
two equations involved in eliminating ;q.] 

9. Let A be Hermitian of order n and /x be an approximation to the simple 
eigenvalue A. Let the eigenvector x correspond to A. For simplicity, suppose 
max \x t \ = x n = L The usual way to approximate x consists in solving the 

i 

n — 1 equations 

(26) (A - p,i)x = -c, 

where A, /, x, c consist in deleting the last row and column of A and I and 
deleting the last component both of x and of the last column c of A, respectively. 
Then define the residual 


(27) 


/(/Li) = c*x + a nn - 
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(a) From (26), since x is a function of /z, verify by differentiation with 
respect to n that 

, ? f.dx 

If (jl is close enough to A, and A — A/ is non-singular, then 


From (27), 


dx 

dfji 


(A — /z/) ~ 1 x. 


df 

dfi 


= c*(/f — (ii) l x — l. 


(b) Verify, then, that in Newton’s method for improving /z 


V 


/Of) = X^T] 

fin) x * x ’ 


where = 0 for 1 < i < n — 1 ; 77 * = fin)- 

10. Define the circulant matrix A = ( a tj ) y of order n , generated from 
(a ly a 2 , .. a n ) by 




'^-( + 1 , * < / 
a n + j -1 + 1 , i > j. 


Show that the eigenvectors {a,} and eigenvalues {A,} are given in terms of the 
n roots of unity {cu,} [i.e., ( ajj) n = 1 ] by 


U, = (1, lO Jy OJj 2 y . . l ), 

A] = Q\ + a2<oj + £ 73 ^ 0 / 4 - • * * -+■ a n <juf / 


11. For the circulant matrix generated by a t = 1, 1 < / < 6 , use House¬ 
holder’s method and Jacobi’s method to find the eigenvalues and eigenvectors. 

12. Show that Householder’s method reduces a real skew-symmetric matrix 
to a real tridiagonal skew-symmetric matrix. Carry out this procedure for 
the circulant matrix generated by (a iy a 2 , a 3y a iy a 5 , a 6 ) = ( 0 , 1 , 1 , 0 , — 1 , — 1 ). 

13. If Ci is of Hessenberg form, then show that each stage of the LR 
transformation requires n 2 + &{n) operations, while each stage of the QR 
transformation requires 4 n 2 + (Pin) operations. 




5 


Basic Theory of Polynomial 
Approximation 


0. INTRODUCTION 

There are numerous reasons for seeking approximations to functions. 
The type of approximation sought depends upon the application intended 
as well as the ease or difficulty with which it can be obtained. In any 
event, the “ simplest ”f approximating functions would seem to be poly¬ 
nomials, and we devote much of our attention to them in this chapter. 
Some consideration is also given to approximation by trigonometric 
functions. We shall study the approximation of continuous (possibly 
differentiable) functions, in a closed bounded interval. 

In general, a polynomial, say P n (x) of degree at most n , may be said 
to be an approximation to a function, /(;c), in an interval a < x < b 
if some measure of the deviation of the polynomial from the function in this 
interval is “small.” This notion of approximation becomes precise only 
when the measure of deviation and magnitude of smallness have been 
specified. 

To this end, we recapitulate the definition of norm , this time for a linear 
space (not necessarily finite dimensional) whose elements are functions 
{/(•*)} ( see Chapter 1, Section 1). The norm, written 

NormCO - N(f) s ||/||, 

is an assignment of a real number to each element of the linear space such 
that: 

f We do not study the theory of approximation by rational functions (i.e., quotient 
of polynomials), even though rational functions are easy to evaluate and, in certain 
cases, are more efficient than polynomials. 
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(i) ll/ll * o, 

(ii) ll/l = 0 iff fix) S 0, 

(iii) \\cf\\ = [c| -1|/|| for any constant c, 

(iv) ||/ + g\\ < 11/11 4- ||g||, the triangle inequality . 

The notion of linear spaces of functions is of basic importance in the analy¬ 
sis of many approximation procedures. In particular, for present applica¬ 
tions we wish to examine not the polynomial approximation of a particular 
function but, in fact, the properties of such approximations for any func¬ 
tion in an appropriate linear space. 

A measure of the deviation or error in the approximation of f(x) by P n (x) 
will be denoted generically by 

I/M - i’.MI 

(sometimes with an appropriate subscript or superscript). This measure 
of deviation will be required to satisfy the properties (i), (iii), and (iv) of a 
norm , but not necessarily property (ii), because |/| = 0 may not imply 
f(x) = 0. Such a measure is actually called a semi-norm. If we were to 
introduce here equivalence classes of functions, i.e., identify f(x) and g(x) 
if I/ - g\ — 0, then the measure | * | becomes a norm in a natural way in 
the linear space composed of these equivalence classes. For simplicity, 
we refer to | | as a norm in this chapter, even though we do not formally 
introduce the equivalence classes of functions. Once such a norm has been 
defined three questions are naturally suggested: 

(a) Does a polynomial exist, of a specified maximum degree, which 
minimizes the error? 

(b) If such a polynomial exists, is it unique? 

(c) If such a polynomial exists, how can it be determined? 

With the convention that, unless otherwise specified, P n (x) represents 
a polynomial of degree at most n, it is clear that 

(1) g.l.b. I/M - P„MI - d n > 0, 

is a monotonic non-increasing function of n . If there exists a unique 
polynomial /^(jc) for which the minimum error is attained, we may then 
investigate methods for determining P n (x) and the magnitude of d n (or any 
other norm of the deviation). In particular, we are most interested in those 
approximation methods for which <7 n -> 0 as « ^oo. 

An example for which questions (a), (b), and (c) are easily answered is 
furnished by a well-known polynomial approximation: the first m -I- 1 
terms in the Taylor expansion of f{x) about x 0 . That is, we consider the 
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linear space composed of functions f(x) which have derivatives of order n 
at x = x 0 . For any given f(x) in this space we define 

(2) P m (x) 3 f(x 0 ) + ( - V -^°- ) / a) fa) + (X ~ V X ° )2 f ,2) (x<>) 


+ ■ • ■ + —-^ f m) (x 0 ), m < n. 

ml 

This polynomial clearly minimizes the measure of error 
(3) lg(*)l <n) = 2 l£ <W (*o)|> 

k = 0 

with g(x) s /(*) — Q m (x) among all polynomials {Q m } of degree at 
most m , since 

f 0, if m — 


l/W - A,(*)l <n) 


2 i/ fc (^o)i, 


= m + 1 


if m < n. 


Thus existence and explicit construction of a best approximating poly- 
nomial for this somewhat contrived normf are demonstrated. Uniqueness 
of P m {x) for a given m < n can be proven by assuming that there is some 
other polynomial of degree at most m , say Q m (x) y which also minimizes 
the error. By expressing Q m (x) as a polynomial in (x — x 0 ) we find that 
the coefficients (x 0 )/kl must be identical with those of P m (x) given in 
(2), since otherwise, | f - Q m | <n) > |/- / > m | <n) . 

Now, if f(x) has an ( n + l)st derivative in some interval about x 0 , 
say \x — x 0 \ < a , then by Taylor’s theorem the remainder in the expansion 
(or we may call it the pointwise error in the approximation) is given by 

(4) R n (x) = fix) - P n (x) = ■ X ( 2 + 0 iyr / (n +1>( ^’ 

where \x — x 0 [ < a and f = £(jc) is some point in the open interval 
(. x , x 0 ). For the special function f(x) = 1/(1 + x) in the interval — \ < 
x < 2 and x 0 = 0, we find P n (x) = 1 — x + • • • + ( — l) n x n and 


R n (x) 


(- l) n + 1 JC n + 1 
1 T x 


In this case, although |7? n (x)| (n) = 0, we note that for the maximum 
norm 


ll*»WII 


l.u.b. |^„(x)| > 

- Y 2 <x<2 


2 n + 1 
~Y~ 


t Note that we may have te(;t)| <n) = 0 but #(*) ^ 0, e.g.,^0) = {x — * 0 ) rl + 1 . Thus, 
on the indicated space, (3) is an example of a semi-norm which is not a norm. 
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and hence some caution must be used. What may be a good approximation 
when measured in one norm [i.e., in | -| (n) of (3)] may be a very poor 
approximation in another norm (i.e., || • ||oo)- But, in this example, if the 
interval were —\<x<\, then 

l^nWIU ^ l.u.b. |* n (x)| < 2- 

-y 2 <x<y 2 

and we find that the Taylor series, P n (x% converges uniformly with respect 
to x for this function. In fact, the series converges uniformly in any closed 
interval contained in the open interval (—1, 1). The latter property of 
Taylor series is typical and is more fully treated in the study of analytic 
functions of a complex variable. 

The questions (a), (b), and (c) can be answered in many other specific 
cases and we do this for several different norms in this chapter. However, 
question (a), of existence, can be given an affirmative answer quite 
generally. We do this in Theorem 1. The answer to question (b), on unique¬ 
ness, is a qualified yes given in Theorem 2. (For all the specific approxi¬ 
mation problems treated in this chapter the answer is yes.) A general 
answer to question (c) is not known but we show how to construct the 
minimizing polynomial for several norms. 

For the general results to be presented we assume that the polynomials 
and the functions,/(x), to be approximated are in a linear space C[a , b] 
of functions defined on the closed bounded interval, [a, b ]. Then we have 

theorem 1. Let the measure of deviation | | be defined in C[a , b], and 
let there exist positive numbers m n and M n which satisfy 

(5) 0 < m n < | ^ bjX’ j < M n , n = 0, 1,.. 

for all {b } ) such that 

Z i^i 2 = i- 

3=0 

Then for any integer n and /(x) in C[a, 6] there exists a polynomial of degree 
at most n for which 

I /(*) ~ -PnWI 

attains its minimum over all such polynomials . 

Proof Write the general «th degree polynomial as 
P n (x) = a 0 + a x x + • • • + a n x n y 
and consider the function of the n 4- 1 coefficients 

<Ka 0 , a u ■ ■ •, a n ) = | f{x) - P n (x)|. 
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By the properties (iii) and (iv) of norms and the hypothesis (5), we obtain 
the continuity of <f>, as follows: 

<f>(a 0 + f 0 , fli + «i, • • ■, « B + «») = I /(*) - P n (x) - 2 e ! x> 

I 1 = 0 

< <f>(a 0 , a u ..a n ) + I Y <r,jc' 

I i = 0 

n 

< ^(a 0 , fli, • ■ a n ) + 2 hi • l*T 

J’ = 0 

n 

< <l>(a 0 , a u ..., a n ) + M n 2 hi* 

1 = 0 

Similarly, we find 

n n 

^(a 0 , fli. ■ = I/O) - AO) - 2 + 2 

i = 0 = 0 

71 

< </>(a 0 4- e 0 , + € i, . . + * n ) + M n ^ l e ii* 

/ = o 

Hence for any {a y } and 

(6) \tfao "T ^1 "1” e n) ^1> « • *j ^n)[ 

^ 2 W* 

y = o 

This demonstrates that <f>(a 0 , a n ) is a continuous function of the 

coefficients ( a 0 , a 1# .. 0 n ). (Compare Lemma LI of Chapter 1.) 

Since <f>(a 0 , a u ..a n ) > 0, the “minimum deviation’* in (1) can be 
characterized as: 

(7) g.l.b. <f>(a 0 , a u .. a n ) = d n > 0. 

(a 0 , a lt . .a n ) 

Thus, the existence problem is reduced to showing that there is a set of 
coefficients, say (a 0 , a u ..a n ), such that 

fiido, Q\) ■ • •> d n ) = d n . 

However, since <f> is continuous, the result will follow from a theorem of 
Weierstrass if we can show that d n is the g.l.b. of <f> in an appropriate closed 
bounded domain of the coefficients. That is, we will show that for some 
R > 0 , 

d n = g.l.b. {<f>(a 0 ,a u .. ,,a n )}. 

1 M 2 sb 2 

i~ o 


(8) 
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Then since the continuous function <f>(a 0 , a u ..., a n ) attains its minimum 

n 

d n on the closed bounded set ^ l a /l 2 - the theorem follows. 

j = o 

To verify (8) we observe, using (iii) and (iv), that 

if-gl < I/I + 1*1, 

and setting / — u + g we get 

(9) |w + g| > M - |*|. 

Therefore, for any constant ^ > 0, by (9) and (iii), 

\f- P n (x)l > \P n (x)\ - If l 

> - i/i. 

n 

Let us pick /x such that ^ \ a iip\ 2 — 1- Then by (5) in the hypothesis of 

/ = o 

the theorem 

l>HtH 


> m n 


and the previous inequality implies 

l/~ -PnWI ^ ti.m n - |/|. 

So, if fji satisfies 

I/I + d n 4- 1 






then 


|/-P»(x)| > </» + 1. 


Thus, we conclude that (8) is valid with the choice R = (|/| + d n + 1 )/m n 
and the proof is ended. ■ 

Observe that the function /(;t) need not be continuous. Furthermore, 
note that this theorem gives no estimate of the magnitude of d n . The 
semi-norm (3) satisfies condition (5) for x 0 = 0, with m n — (n + 1) _1/ *, 

M n = (2 O'0 2 j 2 ; see Problem 2. 

For the uniqueness result, we require the measure of deviation to be 
strict. By definition a norm | • | is strict if 

l/+£l = I/I + \g[ 

implies there exist constants a, fi such that |a| 4- |)S| f 0 and 

“/(•*) + fig(x) = o. 
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Now we state 

theorem 2. To the hypothesis of Theorem 1 add the requirement that | • | 
is strict. Then the minimizing polynomial , say P n {x), is unique. 

Proof. Assume, for a given /(x), that /^(x) and Q n (x) both minimize 
(1). Then from (7), (iii), and (iv), we find 

d n < \f~ i(Pn + Qn) I < ilf~ AJ + i\Qnl = d n . 

Hence, equality holds throughout, and since | • | is strict there exist non¬ 
trivial constants a and such that 

\ t/M - Prix)} + \ [fix) - Q n (x)] = 0. 

Now if a = — / 0 then P n (x) = Q n (x). Otherwise, a ^ and then 

/(x) must be a polynomial of degree at most n. In this case, d n = 0, and 
using (5) it follows that P n (x) = /(x) = £? n M* ■ 

Again, observe that the function /(x) need not be continuous, as re¬ 
marked after the proof of Theorem 1. This theorem is valid for a semi¬ 
norm which satisfies the appropriate additional conditions [i.e., (5) and 
strictness]. Of course, the minimizing polynomial may be unique even 
though the norm is not strict. Such an instance is furnished by the semi¬ 
norm (3) which is not strict. Further examples follow. 

PROBLEMS, SECTION 0 

1. Show that if | * | is a norm defined for polynomials and satisfies (i)-(iv), 
then 

g i b. I J bjx 1 1 = m n > 0. 

JoN *-* 1 ' -0 

[Hint: Prove that 

0(*o, . .., a n ) = | 2 a t xi | 

is a continuous function of ( a 0 , a i,. .., a n ).] 

2. Verify that for the semi-norm 

lg(x)l n) = 2 k (fc) (0)l, 

fC — O 

(5) is satisfied with 

m n = (n + 1)" ,/2 , M n = 2 C/0 2 ] * 

[Hint: Note that some b , satisfies \bj\ > (n + 1) ,/2 . On the other hand, 
apply Schwarz’ inequality to estimate 2 C/OI^jU 

7 = 0 
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1. WEIERSTRASS* APPROXIMATION THEOREM AND BERNSTEIN 
POLYNOMIALS 

We are justified in seeking a close polynomial approximation to a 
continuous function throughout a finite interval because of the funda¬ 
mental 

WEIERSTRASS’ approximation theorem. Let /(x) be any function contin¬ 
uous in the ( closed ) interval [< a , b]. Then for any <r > 0 there exists an integer 
n — n(fi) and a polynomial P n (x) of degree at most n such that 

\f{x) - P n (x)\ < €, 

for all x in [a, b ]. 

This theorem guarantees that arbitrarily close polynomial approxi¬ 
mations are possible throughout a closed bounded interval provided only 
that the function being approximated is continuous. The statement of the 
theorem is one of existence and gives no hint about constructing such 
approximations. However, a simple and elegant constructive proof of this 
result is due to Bernstein and we shall present it in Theorem 1. 

First, a basic notion in analysis must be recalled and some preliminary 
identities will be introduced. If f{x) is a continuous function in a closed 
interval, say [0, l],f then the modulus of continuity of /(x) in [0, 1] is 
defined as 

(1) <o(f; 8)= l.u.b. \f(x)-f(x')\. 

(x. x' in [0, in 
X \x-x'\<6 f 

Since /(x) is continuous in a closed interval, and hence uniformly con¬ 
tinuous, it follows that 

lim a >(/; 3) — 0. 

<5->0 

If, in addition,/(x) satisfies a Lipschitz condition in [0, 1], i.e., if 

I/O) -/0')l ^ A 0 - *'l> 

for x, x' in [0, 1] and some constant A, then from (1): 

co(f; S) < X8. 

The concept of a modulus of continuity is generally useful in analysis and 
its use will recur in our study. 

f We need only consider this case since: 

An arbitrary finite interval a < y < b is mapped 1-1 onto the unit interval 0 < x < 1 
by the continuous change of variable: x ~ (a - y)j{a - b) or y = (b - a)x + a. 
Hence, if g{y) is continuous in [a, b],f(x ) = g({b — a)x + a) is continuous in [0, 1}. 
Now, if/Vx) approximates/(x) in [ 0 , 1 ] to within €, then Q n (y) = P n ({a — y)l(a — b)) 
is a polynomial of degree at most n that is within e of in [a , b]. 
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The identities required all follow from the well-known binomial 
expansion 

(2a) (a + b) n = 2 

i= o \J/ 


Upon forming d(a + b) n jda and d 2 (a + b) n /da 2 we obtain the further 
identities 

(2b) aia + b)'- 1 - J 

y = 0 n / 


(2c) 


(—> 


! (a + 6) n 




Now set a = x, and = 1 — x; define the «th degree polynomials 
(3) = (y)^ 1 “ 7 = 0, 1,.. n; 


and the identities (2) become 

(4a) J Pn,i(x) = 1, 

;/ = 0 

(4b) 2 i A./*)=«*, 

( 4c ) 2 ^n.yW = ( 1 -^ 2 + U- 

It should be noted that for x in [0, 1], ^ n ,j(x) > 0. 

Let the unit interval [0, 1] be subdivided into n equal subintervals with 
the endpoints 

(5) Xj = 4 j = 0, 1,.. n. 


We finally introduce the Bernstein polynomial of degree n for the function 
f(x) on [0, 1] by the definition: 

(6) B n (f; x) = Zf(x,)fi n Jx). 

1 = 0 


This is, from (3), clearly a polynomial of degree at most n and has co¬ 
efficients depending upon the values of/(x) at n + l equally spaced points 
in [0, 1]. 
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theorem 1. Let f(x) be any continuous function defined on [0, 1]. Then 
for all x in [0, 1], and any positive integer n , 

(7) < !*(/; »-*). 

Proof From (4a) and (6) we may write 

fix) - B n (f; x) = 2 Uix) - fixjWn.fx) = 5i(x) + S 2 (x), 

J = 0 

where we define, for any 8 > 0, 

Si(x) = 2 •••’ 2 

/ |x-Xy|<<5 y.|x-xy| ><5 

Thus by the definition of a>(/; 8) and (4a), 

|Si«l <«*(/; S) 2 

y:ix-xy|^<5 

< co(f; 5) J A../*) = co(f; 8). 

y = 0 

For the remaining sum, since \x — x } \ > S, we note 

f(x) -/(*,) = [/(x) -/(f,)] + [/(fO -/(f 2 )] + • • • 

+ [/(£,_,) -/(f P )] + m P ) -f(x } )i 

where p = [|x — Xj|/S],t and - ? are P points inserted uni¬ 

formly between (x, x^) where each of the p + 1 successive intervals is of 
length |x — Xy| j{p + 1) < 8. Hence, 

I fix) -fix,) I <(P+ 1 )coif; 8) < (l + ^2^) aj( / ; 8 >- 
Therefore, 

\S 2 ix)\ < coif; 8) [ 2 A..X*) + 1 2 I* - ■*/!&..X*)l 

Ly.|x-x/|>«5 ° j.\x-xp>6 J 

< coif; 8) [ 1 +4 2 ( x - x >fP*.jix) 

L 0 y.|x-x>i><5 


< coif; 8)[l + j 2 2 (* - *>) a A..X*)]- 

From (5) and (4) 


2 i X - XjfPn. jix) 

y = o 


x(l - x) J_ 
n ~ An 


t P = [*] for x > 0, means /> is the largest integer satisfying p < x. 
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Therefore, 

\Si(x)\ < S)(l + 

and finally, 

I/O) - B n (f;x) | < 1^0)1 + |5 2 0)| 

s 8 >( 2 + M 

The error estimate (7) is obtained, if we choose 8 = ■ 

Weierstrass’ theorem follows by picking n so large that w(f; n~ i/2 ) < 
4e/9. 

If f(x) satisfies a Lipschitz condition, we find easily: 
corollary. Let f(x) satisfy a Lipschitz condition 
I/O) -Ay) I < A|jc — y\, 
for all x , y in [0, 1]. Then for all x in [0, 1], 

(8) I fix) - B n (f;x) | < fA■ 

It can be shown that the approximation given by B n (f;x) may be 
better than is implied in this result (see Problem l). However, in general, 
even if f(x) has p derivatives, the convergence is, at best, of order 1 jn. 
In fact, it can be shown that 

lim n[B n (f;x) -/O)] = if" OOO - -*), if p > 2. 

n~* oo 

As such convergence is quite slow compared to that of many other poly¬ 
nomial approximation methods (see Theorem 9 of Section 3), the Bernstein 
polynomials are seldom used in practice. It should be emphasized, how¬ 
ever, that they converge (uniformly) for any continuous function when many 
of the other polynomial approximations do not. 

The Weierstrass approximation theorem is valid for functions of several 
variables which are continuous on appropriate sets. In fact, the Bernstein 
polynomials can again be employed to yield a constructive proof in various 
cases (see Problem 2). 


PROBLEMS, SECTION 1 

1. If f(x) satisfies a Lipschitz condition with constant A in [0, 1], show 
E n = I f{x) - Bn(f ; x)\ < (A/2)*- 1 /*. 

[Hint: Use Schwarz’ inequality to get 

En = | 2 [/(*) - 3n.,(*))*| 

< (2[/W -Ax t )] 2 Pn.,(.x)} y >- 
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2 lf(x) - /(*,)]%. M < A 2 ^ (X - ,(*) 

A 2 ] 

“ 4n\ 

2. Let f(x> y) be continuous on the closed unit square: 0 < x < 1, 
0 < y < 1. Then prove Weierstrass’ theorem for this case by employing the 
polynomials: 

B m . n (f;x,y)= 2 i /(“’ j)pm.i(x)Pr<. k (y). 

/ = 0 Jc = o n ' 

Show how to extend this theorem to functions of more variables continuous 
in arbitrary “cubes” with faces parallel to the coordinate planes. 

[Hint: Let R(x y y) = f(x, y) — B mn (f; x, y). Show that 

R(x,y) = 2 [/(■*» y) - /(L ^jj^.i(x)Pn.k(y), 

1*1*1 Z I +1 Z I +1 Z I 

f\x-Xf\£ j\x-xj\>6 k- |y-yid><5 

\|y ~yk\ * t>) 

and use the reasoning of the one variable case. 

To prove the theorem when/(x, y ) is continuous in a square, 0 < x — a < c> 
0 < y — b < c: define g(u , v) = f(cu 4- a, cv + b) for 0 < u, v < L Then 
construct B m>n (g; w, t>). Finally, set 

«...(*./)• — 

and observe that |/(x, y) - 7V n (x, y)\ can be made small.] 

3. * If f(x) has a continuous first derivative in [0, 1], show that the first 
derivatives of the Bernstein polynomials which approximate f(x) converge 
to /'(*) uniformly on [0, 1]. 

[Hint: Verify that 

fihj = n(fin-l.i-l - pn-i.i) for j = 1,1, 

fin.n = fin.O = ftfin - 1 , 0 * 

Then regroup the sum in terms of jS n _ 1>fc for k = 0, 1,.. n — 1.] 


2. THE INTERPOLATION POLYNOMIALS 

An approximation polynomial which is equal to the function it approxi¬ 
mates at a number of specified points is called an interpolation polynomial. 
Given the n 4- 1 distinct points x h i = 0, 1 ,..n and corresponding 
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function values /(x t ), the interpolation polynomial of degree at most n 
minimizes the norm:f 

(1) I/M - AML - 2 I/M) - AM)I- 

i = 0 

We shall show that such a polynomial exists (by explicit construction) and 
is unique; in fact, that the minimum value of the norm in (1) is 0. 

The least possible value of [/ — P n is, of course, zero. Thus, we seek 
a polynomial 

(2) Q „M = 2 

k = 0 

for which Q n {x J = /(x ( ). By considering the coefficients a k in (2) as un¬ 
knowns, we have the system of n -f 1 linear equations 

(3) Q n (Xi) = H-h a n x x n =/(x t ), i = 0, 1,..#i. 

This system has a unique solution if the coefficient matrix is non-singular. 
The determinant of this matrix is called a Vandermonde determinant and 
can be easily evaluated (see Problem 1) to yield 

1 x 0 ■ • ■ x 0 ft 

(4) 1 Xi Xi " =nM-*,)-ff[n m-m- 

: : : i>y y = o Li=y + i 

1 • x n n 

Since the {xj are distinct points, the determinant does not vanish and 
(3) may be uniquely solved for the a t to determine the interpolation poly¬ 
nomial. Another proof of uniqueness is given in Lemma 1. 

Rather than solve the system (3), we may use an alternative procedure 
to obtain the interpolation polynomial directly. Set 

(5a) AM = 2 /M)A.,M, 

y = o 

where the n + 1 functions <t> n ,j{x) are nth degree polynomials. We note 
that P n (Xi) — f(x x ) if the polynomials <f> n>j {x) satisfy: 

<!>n,Ax () = i,j = 0, 1,.. 

t This is only a semi-norm. Of course, Theorem 0.1 shows that there is a P n (x ) which 
minimizes the norm in (1), but we prove more here, namely that d n = 0. We also 
prove uniqueness which is not covered by Theorem 0.2, since (1) is not strict. 
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Such polynomials are easily constructed, since the {x t } are distinct, i.e., 

, (Y X = (* - XqKx - *i)- • (* - • (* - *n) 

1 ; <Pnj{ ’ (x, - X 0 )(X) - xf- • -(Xy - x j . 1 )(x j - X / + 1 )-■ (Xj - *„)’ 

j = o, 1,..«. 


By introducing a> n (jc) s (jc — * 0 )(* — jcJ • • • (jc — Jt n ) we find that (6a) 
can be written in the brief form [where 6o n '(;t y ) = (dto n (x)/dx) x = Xf \: 

Unix) 


(6b) 


<f>n.j(x) = 


(X - X,)oJ n ’(Xy) 


The interpolation polynomial, especially when in the form (5), is called 
the Lagrange interpolation polynomial and the polynomials (6) are called 
the Lagrange interpolation coefficients . We can use the product notation 
for <f> which yields 


(5b) Pnix) = j> fix,) ri 

/ = 0 k — o 

k±j 

That the Lagrange interpolation polynomial is identical to the poly¬ 
nomial defined by (2) and (3) is a consequence of the following 

lemma 1. Let P n (x ) and Q n (x) be any two polynomials , of degree at most «, 
for which 

Pn(Xt) = GnOt), i = 0, 1, 2, . . rt, 

w/iere the n + l points {jq} are distinct . 77ze« P n (jc) = G n (x). 

Proof Define the polynomial 

£>nO) = P n (x) - 0 n (x), 

which is of degree at most n. This polynomial has at least n + 1 distinct 
roots: 

D n (x t ) = 0, / = 0, 1,.. n. 

However, the only polynomial of degree at most n with more than n 
roots is the identically vanishing “polynomial” D n (x) = 0. ■ 

In summary, there is only one polynomial of degree at most n for which 
(1) vanishes and it is given by (5) and (6). Of course, there are many other 
ways of representing this polynomial; since such considerations are of 
great practical interest they form a large part of the next chapter. 


2.1. The Pointwise Error in Interpolation Polynomials 

The pointwise error between a function, /(*), and some polynomial 
approximation to it, P n (x), is defined as 

(7) R n (x)=f(x)-P n (x). 
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It is, of course, quite useful to have an explicit expression for this error 
and, if possible, simple bounds on it. Such information may yield, as in 
the example of Section 0, an estimate of the rapidity of convergence of 
/Vx) to /(x) as n—>co (or of divergence). Further, it facilitates a com¬ 
parison of the different types of polynomial approximation. 

For interpolation polynomials, a useful representation of R n (x) is 
readily obtained. This result may be stated as 

theorem 1. Let /(x) have an (n + 1 )st derivative, f (n + ^(jc), in an interval 
[a, b). Let P n {x) be the interpolation polynomial for /(x) with respect to 
n + 1 distinct points x t , i = 0, 1 ,..n in the interval [a, b] (, i.eP n (Xi) — 
f(xi ) and x t e [a, 6]). Then for each x e [a, 6] there exists a point f = f(x) 
in the open interval: 

(8) min (x 0 , x l9 ..x n , x) < f < max (x 0 , x u ..., x n , x), 
such that 

(9) f{x) - P n {x) ^ R n {x) = (X ~ X ° )(A ( ~ + Xl 1 ) ) V ~ ~ / (n+1> (0 

= - + 

~ (n + l)! 7 

Proof Since 

R n (x 0 ) = R^x^ = • • • = R n (x n ) = 0, 
we define S n (x), for any x # x t , by setting 

(10) R n (x) = (x - x 0 )(x - Xj)- • (x - x n )^ n (x) = oj n (x),S n (x). 

Considering x to be fixed as above we also define a function F{z) by 
F(z)=f(z)-P n (z)- aj n (z)S n (x). 

Clearly, this function and its derivatives with respect to z are defined 
and continuous wherever /(z) and its derivatives are defined and con¬ 
tinuous; thus, F (n + 1) (z) is defined in [a, b]. (See Problem 5 for a mild 
generalization.) 

We see that F(z) vanishes at n + 2 distinct points in [a, b], namely 

F(x 0 ) = F(x x ) = ■ ■ ■ = F(x n ) = F(x) = 0. 

Thus, there are n + 1 adjacent intervals in [a, b] at whose endpoints F(z) 
vanishes. Rolle’s theorem is now applicable, since F\z) is defined in 
[a, b]. Therefore, in the interior of each of these intervals, there is at least 
one point at which F\z) vanishes. Thus, there are at least n + 1 distinct 
points in the interval (8) at which F'(z) = 0. They form at least n intervals 
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such that in the interior of each, by another application of Rolle’s theorem, 
the derivative of F’{z) vanishes. That is, F"(z) = 0 for at least n distinct 
points in (8). By continuing this process we find that there is some point, 
say in (8) at which the (n + l)st derivative of F(z) vanishes. 

However, since P n (z) is an nth degree polynomial, 

d n + 1 P n (z) n 

dz n+1 

and a simple calculation yields 

KOOSnto] 

Thus, 

F (n + 1) (z) = / (n+1) (z) 
and since F {n+1) (£) = 0 we obtain 

(11) 5„(x) = ^ / Yy / (n+for x * x °< *i> • • •> *»• 

With this in (10) the theorem follows. It should be pointed out that, al¬ 
though S n (x) is not defined for jc = x h the final result (9) is valid for all jc 
in la , b] [in fact, since ^(jc*) == 0, £ for these values of jc may be picked 
arbitrarily]. ■ 

If the maximum and minimum of/ <n + 1) (x) in [a, b ] can be determined, 
(9) will yield bounds on the error. It should be noted that the error (9) 
for interpolation polynomials is similar to the remainder in Taylor’s 
expansion (0.4). In fact, we might naively assume that if \x — x t \ < 
[jc — x 0 \ for i = 1, 2 ,..n then the interpolation polynomial error is 
smaller than the error in Taylor’s expansion about the point x 0 . This 
assumption is not always justified since the terms / <n + 1) (£) in (0,4) and 
(9) are not evaluated at the same point £ for a given jc. 

Does the sequence of interpolation polynomials {/^(x)} converge to 
f(x) in [ a , b] if {(x ( 0 n) ,..x ( n n 0) covers [a, b]l This is a question that is not 
completely answered. In the case of uniform spacing [i.e., xi) n) = a , 
xj ft) = x 0 + jh n , h n = (b — a)/n ], we illustrate the fact that divergence is 
to be expected by studying Runge’s example, f(x) = 1/(1 4- x 2 ) over 
[ — 5, 5] (see Chapter 6, Subsection 3.4). On the other hand, in Corollary 2, 
Theorem 2 of Section 5, we exhibit a sequence of non-uniformly spaced 
points, {(jc ( 0 n) ,..., x^ 5 )}, for which uniform convergence of {P n (x)} to 
/(jc) may be established for any function /(jc) with continuous second 
derivatives/ 

t Amazingly, for any sequence x ( f \ . . there exists a continuous function 

f{x) for which \F n (x) - f(x)\ 4> 0! 


= (n + 1 )lS n (x). 
- (n + l)!S n (x), 
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From the remainder theorem we can deduce some interesting and useful 
identities satisfied by the Lagrange interpolation coefficients. Since 
d n + 1 x m /dx n + 1 = 0 for m = 0, 1 ,.. n, the interpolation polynomials of 
degree n represent exactly all polynomials of degree at most n . Thus, with 
f(x) = ;t m in (5), equation (9) yields 

(12) 2 W'n./*) = x m , m = 0, 1,.. n. 

i =o 

The case m — 0 is particularly useful. [Compare (12) with equation (1.4) 
for the Bernstein polynomials.] 

2.2. Hermite or Osculating Interpolation 

The osculating polynomial , a generalization of the interpolation poly¬ 
nomial, is obtained by requiring agreement at the distinct points of inter¬ 
polation, Jt y , with the first r s — 1 derivatives of f(x). (This polynomial 
arises also as the limit of the interpolation polynomials when r y points 
of ordinary interpolation approach each other at the point *,.) This 
procedure contains, as special cases, Taylor’s expansion and ordinary 
interpolation. The number of combinations is boundless, but, in fact, 
a representation of the osculating polynomial can be found together with 
an expression for the pointwise error (see Chapter 6, Section 1, Problem 
10). We shall consider in detail the case in which the function and its 
first derivative are to be assigned at each point of interpolation. This 
special procedure is usually called Hermite or osculatory interpolation . 

The problem is to find a polynomial of least degree, say H 2n + i(x), 
such that; 

f(x,) = H 2n + 1 (x f ), 

(13) j = 0, 
fix,) = H 2n + 1 (x,), 

By counting the data (i.e., 2n + 2 conditions), we find that a polynomial 
of degree In + 1 has the required number of undetermined coefficients. 
Thus, in analogy with the Lagrange interpolation formula (5), we seek a 
representation in the form 

(14) H 2n+l (x) = 2 /(*>)&../*) + 2 

2—0 j-0 

Here the polynomials 0 n>J (jt) and V n y(;t) are required to be of degree at 
most In + 1 and to satisfy 

= S w , Tn./X,) = 0, 

(15) ij = 0, 1,..n. 

fn.fxi) = 0, x K.fx) = 3,„ 
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Such polynomials are given in terms of the Lagrange interpolation 
coefficients, <f> nJ {x), as: 

^n.A) = [1 - 2 - XjMljix), 


(16) 


%../*) = (x - x,)<t>Z.f(x). 


The error in using (14) to approximate f(x) is 
(I?) fix) - H 2n+1 ( X ) = 


provided f(x) has a continuous derivative of order 2 n + 2. The point 
£ = £(x) is again in the interval determined by the points x, x 0 ,. .x n . 
The proof of formula (17) is left to Problem 3. 

From equation (17) we easily deduce, in analogy with (12), the identities: 

(18) 2 W'n./W + m 2 *7" 1XF n.,M = X m , 

j=o y-o 

m = 0, 1,.. 2n + 1. 


PROBLEMS, SECTION 2 


1. Evaluate the Vandermonde determinant to verify (4). 

[Hint: Let each x i9 j = 0, 1,.. n> in order, be considered variable and 
determine all the roots of the resulting polynomial. The remaining scalar 
factor is obtained by evaluating the coefficient of any specific term, say 
(1 -^!-jc 2 2 .x n n ). An alternative proof could be given by using mathe¬ 

matical induction and expanding the determinant with respect to the elements 
of the last column.] 

2. Prove that the system (3) is non-singular by assuming that the homo¬ 
geneous system has a non-trivial solution and using Lemma 1 to obtain a con¬ 
tradiction. 

3. Derive the error formula, equation (17), for Hermite interpolation if f(x) 
is sufficiently differentiable. 

[Hint: Proceed exactly as in the derivation of the interpolation error and 
define: F(z) = f(z) — H 2n + i(z) - a > n 2 (z)S n (x). After the first application of 
Rolle’s theorem, however, F'(z) will have 2n + 2 distinct zeros.] 

4. Formulate the definition of the Hermite interpolation polynomial as a 
minimizing polynomial for the appropriate semi-norm. Does this semi-norm 
satisfy hypothesis (0.5) of Theorem 0.1 ? Is it strict? 

5. Show that the conclusion of Theorem 1 follows under the weaker assump¬ 
tion: f(x) is continuous in the closed interval [a, 6], but has the requisite 
derivatives only in the open interval ( a , b). 
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3. LEAST SQUARES APPROXIMATION 

A property which is frequently used to determine an approximating 
polynomial, 

(1) Q n (x) = a 0 + a^x H-1- a n x n , 

is that the L 2 norm, or mean square error, 

(2) [I/O) - Qn 0)112 = | £ f/0) - CnO)] 2 

be a minimum. For the general polynomial (1) and any appropriatef 
function f(x), we define the function of n 4- 1 variables, 

(3) <f>(a 0 , a,,..., fl B ) a f [f(x) - G n (x)] 2 dx. 

Ja 

The least squares polynomial approximation to f(x) of degree at most n 
is then determined by finding a point (d 0 , a l9 ..d n ) in the « 4- 1 dimen¬ 
sional space for which <j> is a minimum. 

theorem 1. For each appropriate function f{x ), there is a unique least 
squares polynomial approximation of degree at most n which minimizes (2). 

Proof The hypotheses of Theorems 0.1 and 0.2 are satisfied by ||-|| 2 
(see Problem 10), whence existence and uniqueness are established. ■ 

We now give an analytical description of a method for calculating the 
coefficients a = (d 0 , & l9 ..., d n ) of the polynomial that minimizes <£(a) 
in (3). 

Since 

(4) <f>(a 0 , a„) = f f\x) dx - 2 2 a t f x ‘f( x ) dx 

J a i = 0 vo 

n n 

+ 22 a ‘ a * x i + l dx, 

1 *= 0 i = 0 v a 

<f> is a quadratic function in the variables a t . Now, at the minimum of (/> 
the coefficients a must satisfy 

8<f>(a 0 , fll ,..., o n ) =Q A: = 0, 1, .. 

dflfc a = fi 

t The given function,/^), for this purpose need only be restricted so that it and its 
square are integrable over [ a , b]. The complete linear space for which (2) is a norm 
consists of all such functions, if we identify two functions which differ only on a set 
of measure zero in [a, b], and use the Lebesgue integral. But we do not pursue this 
avenue of generalization and shall, unless otherwise noted, consider only functions 
that are continuous, except at a finite number of points, where certain conditions 
will be specified. 
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-a 


(5a) 


= 0 — 2 f x k f{x) dx + 2 f x l + k dx 
& Ja j = o Ja 

* r b 

+ 2 d,\ ^ k+i 

y = 0 •'a 

r n r b r b 1 

— 2i ^ d* x i + k cb c— x k f(x) dx U 

L{= 0 Ja Ja J 


dx 


k = 0 , 

A system of n + 1 linear equations for the determination of the {dj is 
defined in (5a); it is frequently called the normal system. 

We write the normal system (5a) in the form: 

n 

(5b) 2 My - c„ i = 0, 

y = o 

where the coefficient matrix and right-hand side are given by 
(5c) H n + 1 (a, b) = (h u ), h i$ = f x i+i dx; c, = f ^/(x) dx. 

J a J a 

Now we have 

theorem 2. TTie coefficient matrix H n + 1 (a, b) is non-singular. 

Proof. For a given arbitrary vector (c 0 , c 1? ..c n ) it is possible to find 
a polynomial f(x) y such that 


/• 


x k f{x) dx = c k . 


0, 1,.. n. 


In fact, the polynomial can be of degree at most n and we leave this con¬ 
struction to Problem 11. If 


f(x) = 2 M, 


then (5a) has the solution d ( = a u i = 0, 1Therefore, the system 
(5b) has at least one solution for any right-hand side and this implies 
that the system is non-singular. ■ 

In the special case [a, b] = [0, I], we get from (5c): 

1 [ 1 

1 2 


( 6 ) 


tf» + i(0, 1) 


n + 1 
1 

n T 2 


* (Kl 


n + 1 n + 2 


In + 1 
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where h {j = 1 /(/ + j — 1) and Uj = 1, 2+ 1, a so-called Hilbert 
segment matrix. It is difficult to solve numerically a system of equations 
with the matrix //„(0, 1). ( E.g ., solve H 4 x = (0,0,0, 1) T : (a) exactly; (b) using 
four-decimal-place arithmetic.) However, it is possible to find the explicit 
inverse of //„(0, 1) with the aid of some properties of the Lagrange interpolation 
coefficients (see Problems 1, 2). As a consequence of the non-singularity of 
H n (a, b) y and the fact that (5) is a necessary condition for a minimum of 4>(a 0 , 
a l9 . . . , a n ), we have another proof that the least squares polynomial 
approximation of degree n is unique. 

We may, by the linear change of variable x = a + (b — a)y , trans¬ 
form the problem of fitting f(x) by Q n (x) in [ a , b], to that of fitting 
f(a + (b — a)y) = g(y) by />„(}>) in [0, 1 ]. Afterwards, by setting 

e-w - r -fc=i} 

we have the least squares polynomial that minimizes the norm in (2). 

The least squares polynomial can be determined in a way which avoids 
the difficulties inherent in directly solving the system (5). (This alternative 
procedure is but a special case of the general theory of approximation by 
orthogonal functions .) 

Consider a set of n 4* 1 polynomials {/^(jc)}, k = 0, 1,., n, where 
P k (x) is of degree! k in x. Then without loss of generality, we may let 
Q n (x) be a linear combination of these polynomials, say 

(7) Qn(x) = 2 C //W- 

y = o 

Now the mean square error (2) defines a function 

(8) J(c 0 , c 1 ,...,c n ) = j [f{x) - Qn(x)] 2 dx, 

of the n + 1 variables {c k }. As before, this function is quadratic in the c k 
and at a minimum we must have 

hi c b r b 

0 = — = 0 - 2 P k (x)f(x) dx + 2 2 C J \ p A x ) p k (*) dx; 

oc k Ja j=0 

or the normal system 

(9) 2 o f p ,(x) p k (x) dx = f b P k (x)f(x) dx, k = 0, 1,..., n. 

j = o *1 a * a 

t Here we require that P k (x ) have exactly degree k , in order that the set (/y*)} for 
k = 0,1, 2,..be linearly independent. 
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There seems to be no apparent gain in replacing the system (5) by (9). 
However, the main point of the expansion in polynomials, rather than in 
powers of x, is to choose appropriate P k (x) so that the system (9) is easily 
solved. In fact, the simplest choice would be one which makes the co¬ 
efficient matrix diagonal, or, better still, the identity matrix. This requires 

(10) f Pj(x)P k {x) dx = S jk . 

J a 

In the next subsection, we define such a sequence {/^(jc)}. A set of poly¬ 
nomials (or any functions) which satisfy (10) are called orthonormal over 
[a, b ]. The coefficients c i are now simply given by 

(11) c, - f Pjixyfix) dx, j = 0, 1,.... n. 

J a 

An additional advantage of the expansion in orthonormal polynomials is 
that the accuracy of the approximation (7) can be improved by adding an 
additional term, c n + lJ P n + 1 (x), without having to recompute the previously 
determined coefficients, c 0 , c u .. c n . [It is also clear that (7), with the 
coefficients (11), represents an approximation of least mean square error 
for any set of orthonormal functions , not necessarily polynomials, which 
satisfy (10).] For the approximation determined by (7) and (11), it easily 
follows from (10) in (8) that 

(12a) J(c 0 , Cj,.. c n ) = f f\x) dx - 2 c? ^ 0. 

Ja y = 0 

00 

If we let n^o o it follows from (12a) that 2 c f 2 converges. Hence, we 

/ = o 

deduce that lim = 0. This is a conclusion about the integrals of form 

(11) for general orthonormal functions, Pfx). The inequality (12a) is 
known as Bessel's inequality. 

Convergence in the mean of the least squares polynomial approximation 
to a continuous function is easily demonstrated. Specifically we have 

theorem 3. Let f{x) be continuous on [<a , b ] and Q n (x\ n = 0, 1,2,..., 
be the least squares polynomial approximations to fix) on [a. 6] determined 
by (7) and (11). Then 

lim J n = lim f [f(x) - Q n (x)f dx - 0, 

n->oo n-» oo J a 


and we have ParsevaVs equality 
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Proof. For a proof by contradiction assume that lim J n = 8 > 0. 

n-» oo 

Then we pick c > 0 such that e 2 = 8/[2(b — a)] and by the Weierstrass 
theorem there is some polynomial P m (x) such that | f(x) — P m (x)| < e 
in a < x < b. For this polynomial, 

J [/(*) - Pm{x)f dx < e 2 (b - a) = ~ 

However, by (12a) J n is a non-increasing function of n ; hence, the least 
squares approximation of degree m, say Q m (x ), satisfies 

3/2 > £ L fix ) - QJx)] 2 dx > 8. 

This is a contradiction unless 6 == 0. Of course, this mean convergence im¬ 
plies (12b), the Parseval equality. ■ 

Unfortunately, these simple results yield no information about the 
pointwise approximation of f(x) by the least squares approximation 

n 

Q n (x). In order to estimate R n (x) = f(x) — ^ c i^ x )> with c i defined 

i — 0 

by (11) and {/^(x)} orthonormal, we write 

Rn(x) = Ax) - 2 p,(x) Vnemdt 

/To J a 

= Ax) - f G n (x, i)f(i) dl 

Ja 

where 

(13a) G n (x, P&iPtf). 

i = o 

From the orthogonality property, we observe that 

f <?»(*, £)# = 1- 

J a 

Therefore, we may rewrite R n (x ) as 

(13b) R n (x) = f G n (x, m/(x) - fm dt 

J a 

Now, the rate at which /? n (.x) -> 0, as n ->oo, depends on the nature of 
the kernel , G n (x , |), and on the function f(x). 
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A direct verification of convergence is possible if the sequence Q n (x) = 

n 

2 CjPj(x) converges in the mean to f(x) and simultaneously converges 

/=o 

uniformly in [a, b]. That is, if we define 

g(x) = Iim Q n (x ), 

n— oo 

the function g(x) will be continuous in [a , b], since it is the uniform limit 
of a sequence of continuous functions. On the other hand, because of the 
uniformity of convergence, we may pass to the limit under the integral 
sign, in the statement of mean convergence, to find 

f [/« - g«] 2 dx = 0. 

J a 

Therefore,/(x) = g(x), since they are both continuous. 

Now, it is possible to show that the sequence Q n (x) converges uniformly 
if f(x) has two continuous derivatives! in [ a , b]. We carry out the details 
for the interval [a, b] = [—1, 1], in Subsection 3.4. On the other hand, 
if the function f(x) is merely continuous, the sequence Q n (x) need not 
converge. 

3.1. Construction of Orthonormal Functions 

The method by which a set of orthonormal polynomials (/V(x)} can be 
determined is a special case of a general procedure in which an or¬ 
thonormal set of functions is constructed from an arbitrary linearly 
independent set.|! This process is known as the Gram-Schmidt ortho¬ 
normalization method and is described as follows. 

We begin by defining the inner product of any pair of real valued functions 
fix), g(x) by 

(14) (fg) = (g,f) = [ b f{x)g{x)dx. 

J a 

Now, let {gi(x)}, i = 0, 1,..., «, be n + 1 linearly independent and 
square integrable functions over [a , b]. Consider the functions 

fo(x) = d 0 [g 0 (x)l 

(15) ffx) = dfgfx) - CoJoix)], 

fnix) = dfgfx) - Confix) -C B - 1 .Ji-lW]. 


t This requirement may be weakened. 

tf In analogy with the definition of linear independence for vectors, the set 
i = 0, 1,of functions are linearly independent in some interval [a, b ] if and 

n 

only if the only linear combination 2 atgiW that vanishes identically in [ a , b] has 

i =o 

a t = 0, / = 0, 1,. . n. 
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We seek coefficients d k , c jk such that the set {/(x)} is orthonormal over 
[a, b] \ i.e., using the inner product notation: 

t fufi) = 8 iJ> Uj = 0, 1,.. n. 

To normalize/ 0 (x) we need only require 

(/o>/o) “ d 0 \g 0 , g 0 ) = 1. 

Since g 0 (x) ^ or the set {g t (x)} could not be linearly independent, we 
may define 

( 16 )o do = JL— • 

^ (go, go) 

In order that 

0 = (/o,/i) = d 1 (f 0 ,g 1 — Coifo) 

= rfi[(/o,ft) c oi]> 

we require 

( 16 )oi c 0 i = (/ 0 , gi). 

To normalize f x we set 

(/l,/l) = ^ 1 2 (^1 — <*01/0, gi — Coifo) — 

The inner product on the right cannot vanish, by the assumed linear 
independence of the (gi(x)}, and thus d x is determined to within its sign. 
As in (16) 0 we adopt the convention of using the positive square root. 

In general, then, if (/I,/,) = for all Uj = 0, 1 ,..k — 1, we find 
(fvfk) — 0 for all j — 0, 1, .. k — 1, when we define f k as in (15) with 

(16) Jfc c ik = (/„ g k ), j < k - 1. 

The normalization constant, d k , is easily obtained as before by setting 

(/*,/*) = 1 . 

To apply the Gram-Schmidt procedure to the problem of obtaining 
orthonormal polynomials {P ; (x)} over an interval [a, b]> we observe that the 
n + 1 polynomials 

gjix) = x>, j = 0 , 1 ,.. n, 

are linearly independent over any interval. The proof of their independence 
follows from the fundamental theorem of algebra used in the proof of 
Lemma 2.1. As in (15) we form 

P 0 (x) = 4 ,( 1 ), 

Pi(x) = d,[x - c 01 P 0 (x)}, 

Pn(x) = d n [x n - C 0n P 0 (x ) - cMx) 
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1 = f Po 2 (x) dx = d 0 2 f dx = ^o 2 (^ — a)y 
Ja Ja 


By either repeating the previous derivation or simply applying the formulae 

(16) yfc we have 

(17) 01 c M = [ d 0 x dx = b —^ = V—a (^)' 


Normalizing Pi(x) = d x [x — c 0 id 0 ] yields 

1 = J P 2 {x) dx = d 2 J ^a; 2 — (b + a)x + 


(b + a) 2 


= dx 2 ■ 


£ 3 — a 
3 


3 ,, , 6 2 - a 2 (6 + a) 2 

--(* + ") — 7 — + ' a ~~ 


(b-a) 


= %(b-ay 


or explicitly 


d 1 = 2V3 (b - a)- % . 


The first two polynomials are thus 


pyx) = 2V3 (b — «)-%(* - L+f). 

and any number of them can be obtained by continuing this procedure. 
Let us denote by P n (x ; a , 6), the sequence of orthonormal polynomials 
over [a, b]. Then it is easily verified that 

is their representation in terms of the polynomials orthonormal over 
[— 1, 1]. In Problem 6, we verify that 

P n (x; - 1, 1) = (« + *)* ^ (x* - 1)". 
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The polynomials 

Qn(x) = (>i + i)-*P n (x; -1, 1) 
are called the Legendre polynomials. 

3.2. Weighted Least Squares Approximation 

The mean square measure of approximation defined in (2) gives equal 
“weight” at each point in [a, b] to the deviation of the approximating 
polynomial from the function, /(x). For some purposes, it may be 
required that the approximation be better over some parts of the interval 
1a , b ] than it is over other parts. This suggests a natural generalization of 
(2) which is: 

(18) II/M - Qn(x) lla.w = jj [fix) - QJx)] 2 w(x) dxf 
where w(x) > 0 in [a, b] and 

(19) f vr(x) dx > 0. 

J a 

The non-negative function w(x) is called the weight function ; clearly, if 
w(x) = 1, the usual least squares approximation results. For convenience, 
we require that w(x) be continuous in ( a , b) and have at most isolated zeros 
in this interval. By choosing an appropriate function w(x) and finding the 
corresponding Q n (x) which minimizes (18), we may obtain an approxi¬ 
mation with good relative accuracy in a specified region of [a, b]. An 
extreme example of this is illustrated by the fact that the interpolation 
problem can be formulated as a special limiting case of (18) (see Problem 4). 

If Qn(x) is assumed of the form (1) then, as before, we find a system of 
equations for the determination of the coefficients {<z t }: 

a { = J x fc /(x)w(x) dx , k = 0, l,.. n. 

Again, the necessity for solving this system can be eliminated by intro¬ 
ducing a set of polynomials having appropriate properties. Specifically 
we call a set of functions {P/x)} orthonormal over [a, b] with respect to 
the weight w(x) if 

(20) f P } (x)P k (x)w(x) dx = & jk . 

J a 

Then to minimize (18) with a polynomial of the form (7), we find that 

(21) c f = J p i( x )f( x ) w (x) dx. 


2 f x l + k w(x) dx 
{ = 0 L 7a 
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The construction of orthonormal functions with respect to a weight 
w(x) can be accomplished by the procedure of the previous subsection. 
That is, we introduce, in place of (14), a new definition of inner product 

(22) ( fg) = (g,f)= f f(x)g(x)w(x) dx. 

J a 

The generalizations of Bessel's inequality (12a), the mean convergence 
proof for polynomial approximations of continuous functions and Parse- 
val’s relation (12b) are valid in the present case with essentially no change 
in argument. An important special example w(x) = (1 — x 2 ) -1/2 , [< a , b] = 
[— 1, 1], gives rise to the Chebyshev polynomials (see Problem 9). The 
pointwise convergence of weighted mean square approximations is briefly 
considered in Subsection 3.4. A proof of convergence for the Chebyshev 
expansion of sufficiently smooth functions is given there. 

3.3. Some Properties of Orthogonal Polynomials 

Let the polynomials P n (x), n = 0, 1, 2,.. ., be orthogonal over a < x < b 
with respect to the non-negative weight function w(x). Then we have 

theorem 4. The roots x jy j = 1, 2 ,..n of P n (x) = 0, n — 1,2,..., are 
all real and simple and lie in the open interval a < x } < b. 

Proof \ Let those roots of P n (x) = 0 in ( a , b) be x u x 2 ,..., x„ where any 
multiple root is repeated the appropriate number of times. Then the poly¬ 
nomial 

Qr(x) = (X - JCjX* - X 2 ) • • •(* “ Xr) 

has sign changes wherever P n (x) does in ( a , b) and it is of degree r < n. 
Thus, P n (x)Q r (x) is of one sign in (< a , b) and so 

J P n (x)Q r (x)w(x) dx # 0. 

This can only be true if r = «, since P n (x) is orthogonal to all polynomials 
of lower degree. Now assume some root, say x ly is multiple. Then we can 
write 

P»(x) = (x - x 1 ) 2 p n - 2 (x), 
where p n _ 2 (x) is of degree n — 2. But 

Pn(x)p n - 2 (X) = (f^) 2 > 0 


and hence 


| Pn(x)p n - 2 (x)w(x) dx > 0. 
a 
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But this is a contradiction since P n (x) is orthogonal to any lower order 
polynomial. Hence multiple roots cannot occur. ■ 

The orthonormal polynomials satisfy a simple recursion formula which 
is stated in 


theorem 5. Any three consecutive orthonormal polynomials are related 
by 

(23) P n + 1 (x) = (A n x + B n )P n (x) - CA-iW. 

If a k and b k represent the coefficients of the terms of degree k and k — \ in 
P k (x) then 


(24) A n 



&n -fr 1 /^n+_l b n \ 

Wn + l &n) 


C n 


&n + l^n - 1 

a n 2 


Proof With A n given by (24) it follows that 

^n+lW “ d n xP n (x) = Q n (x ) 

is a polynomial of degree at most n. Hence, Q n (x ) can be expanded as 

Qn ( x ) = a n^nM + * ‘ * + «oP oW- 


By the orthogonality, however, we find that 
a k — { Qn( x )Pk(x)w( x ) dx 

J a 

= f P n + i(x)P k (x)w(x) dx - A n f P n {x)P k {x)xw{x) dx 

J a J a 

— 0 , for k = 0, 1 ,. . n — 2. 

Thus, the form in (23) follows upon setting ce n = B n and a n ^i = — C n . 
Now we may write 


xP„- i(x) = P n (x) + qn-iix), 

where q n -.fx) is of degree at most n — L Then it follows that 
C n = A n j P n (x)P n _ l (x)xw(x) dx, 

= A n^r 1 f Pn 2 (x)w(x) dx + A n f p n (x)q n _ 1 (x)w(x) dx 
a n J a J a 
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The coefficient B n is easily obtained by equating coefficients of the terms 
of degree n in (23) and the proof is completed. We observe that (23) and 

(24) are valid for n — 0 if we define a_ 1 = P-i(x) = 0. ■ 

The result in Theorem 5 can be used to derive what is known as the 
Christoff el-Darboux relation. We state this as 

theorem 6. The orthonormal polynomials satisfy 

(25) [p n+1 wp n (a - /> n+1 (aAW] = (* - o 2 p,(x)p,(0- 

a n +1 j =o 

Proof. Multiply the recursion formula (23) by P n (£) to get: 

Pn(£)Pn + 1 (X) = (A n X + B n )P n (OP n (x) - C n P n (0P n -l(xf 

Since this is an identity, it holds if we interchange the arguments x and 
Subtracting this interchanged form from the original form and multi¬ 
plying by A n ~ x yields, with the aid of (24), 

(x - £)P n (x)P n (0 = A n -'[Pn + l(x)P n (0 - P n + 1 (OPn(x)} 

- A-} 1 [Pn(x)P n -l(0 ~ P n (£)Pn-l(x)]. 

We now sum these identities over 0, 1 and the theorem follows 

(for n = 0, we use the convention a_ x = 0) since A n ~ l = afa n + 1 . ■ 

Theorem 6 gives a convenient representation of the kernel G n (x , f) 
defined in (13a). 

3.4. Pointwise Convergence of Least Squares Approximations 

We first consider the ordinary least squares approximation over [—1, 1] 
in which case the orthonormal polynomials [essentially the Legendre 
polynomials, see (29)] can be defined as 

(26) P n (x; -1,1) = P n (x ) = (X 2 _ 1} n. „ = 0, 1, 2,.... 

The derivation of this representation is contained in Problem 6. Given 
/(x), we find the least squares polynomial approximation of degree at 
most n to be as in (7) and (11) 


n 


(27a) 

Qn(x) = 2 c,P,(x ); 

; = o 

(27b) 

c )-j 1 f( x ) p f( x ) dx - 
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If f(x) is continuous on [— 1, 1], it follows from Theorem 3 that 
(28a) lim f [/(*) - Q n (x)f dx = 0, 

n-oo J _i 

(28b) lim c n = 0. 

n-* oo 

If f{x) satisfies additional smoothness conditions, we can deduce uniform 
convergence of Q n {x) to/(*). In fact, we have 

theorem 7. Let Q n (x) be defined by (27) and let f(x) have a continuous 
second derivative on [—1, 1]. Then for all x e [—1, 1] and any e > 0 

I/M - Qn(x) | < 
provided n is sufficiently large . 

Proof We introduce the Legendre polynomials 

(29) p n (x) = (« + i)~ V2 P n (x) = ^ £ M ~ 1)»; 

n = 0 , 1 , 2 ,... 

some of whose properties are described in Problems 6-8. If we set 
u — x 2 — \ \x\ (29), it easily follows that 

2 n+1 (« + !)!/>;♦,(*) = J^« n+l 

= jJ [2(« + 1 K- l (« + 2 mx 2 )], 

= 2(n + 1) Jr— [(2m + l)w n + 2nu n ~ l ], 

= 2 n+1 (n + 1)! [(2m + l)p n (x) + /C/x)]. 
Thus we have deduced the relation 

(30) p'n + i(x) = (2m + 1 )/>„(*) + p' n -i(x), n = 1,2,...; 
which by (29) can be rewritten for the P n (x) as: 

(31) (m + D-'/PMQc) - (m - 

= (2m + 1)(m + - Vi P n (x) ; m = 1,2,.... 

Now we introduce the notation 

(32) c/ = JVmaM<M c/a JVMAM<k; 

* = 0, 1,..., 
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and use integration by parts to deduce 

(33) c' k±l = J J\x)P k± 1 {x) dx, 

= U(x)P k± iW] - f f(x)P'k±i(x) dx. 

-1 J - 1 

Let us assume for the present that /(± l) = /'(± l) = 0. Then (33) 
simplifies and with (31) and (27b) yields 

= {2k + l)(/c + *)-* J' Ax)P k {x) dx, 

= (2k + i ){k + tyy*c k . 

From this, it follows that 

(34) 

, _ (2k + 1\» A: „ _ (2k + 1\H * 

fc “ \2/c + 3j 2k + 1’ * “ \2A: - 1 j + f 

But since f'(x) is continuous we may use (28) for f'(x) and c n ' to conclude 
from (34) that, since c k -> 0, /!*->£ and B k — 

(35) lim &c* = 0. 

k-* oo 

This argument can be repeated with the function vv(jc) = f\x). By the 
hypothesis, it follows that w'(a) — f"{x) is continuous and in place of 

(35) we get (having assumed that w(± 1) = /'(± 1) = 0) 

lim kc k = 0. 

oo 

However, by using this result in (34), we find 

(36) lim k 2 c k — 0. 

fc-» oo 

From the property 

I Pn(x) \ < 1 

exhibited in Problem 8 we deduce from (29) that for x e [— 1, 1] 

(37) |A(*)| < VFT±. 

By (36) we can pick any c > 0 and find n sufficiently large so that k 2 I 0*1 <6 
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for all k > n. Then from (27) and (37) we have for any m > n and 


I Qm(x) ~ Q n (x) | = 


2 p (k 2 c k )P k (x) 


< e 


(38) \Q m {x) - Q n (x)\ < 


k=n + l^ 

- e 2 p w*)i. 

k = n +1 * 

v vTTi _ v j_ /rr - ! 

-4vi k 2 V 1 + 2k' 

00 1 

k = n-i 1 * 

2V2 € 


Vn 


Here we have used k 3/2 < f It follows from (38) that the least 

Jk-1 

squares polynomials {0 ft (jc)} form a Cauchy sequence and converge 
uniformly on [ — l, 1]. 

If we call 

lim Q n (x) = g(x), 

n-» oo 

then g(x) is continuous on [—1, 1] since it is the uniform limit of continu¬ 
ous functions. Again, by the uniformity we may take the limit under the 
integral sign in (28a) to find, since f{x) and g(x) are continuous, that 
f(x) == g(x). Finally, letting m ->oo in (38) and replacing e by <r/(2V2) we 
get the result stated in the theorem. 

To complete the proof we must eliminate the requirement that 
/(± 1) = /'(± 1) = 0. To do this, we construct, for any /(x), the Hermite 
interpolation polynomial, h 3 (x), for which 

* 3 (±1) =/(±l), *3 # (±1)=/'(±1). 

Then g(x) = f(x) — h 3 (x) satisfies all the requirements of the theorem. 
However, since h 3 (x) has degree at most 3, it follows from (27b) and the 
orthogonality of the polynomials P n (j t), that the c, are unchanged for 
j > 4 if f(x) is replaced by g(x). ■ 


By using the technique of the above proof, we find 
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theorem 8. Let f{x) have a continuous rth derivative on [—1, 1] where 
r > 2. Then with Q n (x) and c n defined in (27): lim k T c k = 0 and 

oo 

I/O) - Qn(x)\ = 0(w _,+a/i ), for all xe [— 1, 1]. 

Proof, See Problem 12. ■ 

A simple linear change of variable yields the 

corollary. If f{x) has r > 2 continuous derivatives in [a, b] then there 
exists a polynomial approximation , q n (x), of degree at most n such that 
I/O) - 0)1 = ®0" r+1/2 ), for all xe [a, b]. 

Proof See Problem 13. ■ 

Analogous results can be obtained for various weighted least squares 
approximations. If the weight function is w(x) and the interval is [a, b ], 
then the approximation to f{x) is 

Qn O) = 2 C f P i O) 

y = o 

where the are defined in (21) and the orthonormal polynomials P n (x) 
satisfy (20). A proof of the pointwise convergence of Q n (x ) to a sufficiently 
smooth f(x) can be given if 

(i) the P n (x) are the eigenfunctions of a regular second order differential 
operator, say 

&{Pn{x)] = a(x)P n "(x) + b(x)P n '(x) = A nJ P„(x), 
whose eigenvalues , A n , satisfy 

lim A n n~ 2 = const.; 

n-»oo 

(ii) the P n (x) are bounded by 

\P n (x)\ = <9( n v>) for all x e [a, b ]. 

In particular, we shall sketch the proof for the case in which the P n (x) 
are related to the Chebyshevpolynomials. These polynomials are orthogonal 
over [—1, 1] with respect to the weight 

<»> "M - vrh' 

In Problem 9 they are determined as (see Section 5): 

P„(x) — cos (n cos -1 x), 


( 40 ) 


n = 1,2,... 
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and they are solutions of 

(41) (1 - x*)P n "(x) - xP n \x) = -n 2 P n (x). 

Thus in this case A n = n 2 . We require f(x) to have a continuous second 
derivative on [ — l, 1]. By the remarks at the end of the proof of Theorem 7, 
we may assume without loss of generality that f(± 1) = /'(± 1) = 0 in 
the present proof. [Since (Pi(x), Pj(x)) = the c n of (21) are unchanged if 
f(x) is replaced by f(x) — h 3 (x), see last paragraph of Theorem 7.] 

From (39) and (41) in (21) with [a, b] = [ — 1, 1] we have 

Cn = -4 f [(1 - X 2 )P n "{x) - xPJx)} dx. 

n J -1 VI -r 

Integrating by parts, all derivatives can be removed from P n (x) to get 

(42a) c n = ~2 f [a(x)f(x) + P(x)f'(x) + y{x)f"(x)]P n (x)w(x) dx, 

n j _ i 

where 

(42b) a(x) m - 5 * m - -3x, y(x) = \ - x 2 . 

Since/(± 1) = /'(± 1) = 0 and f”(x) is continuous, we note that a(x)/(x), 
P(x)f'(x) and y(x)f"(x) are continuous on [—1, 1]. Thus the coefficients 
in the expansion of the sum [af -f pf' + yf"] tend to zero as n ->oo (by 
the analog of Theorem 3 for weighted polynomials). Using this fact in 
(42a) implies 

(43) lim n 2 c n = 0. 

n -* co 

A sharper bound than that in (ii) is easily obtained for the Chebyshev 
polynomials. Clearly, from the representation (40), 

(44) |A.(*)| < J\ 

From (43) and (44) we find, as in the proof of Theorem 7, that the Q n (x) 
converge uniformly. However, by using this fact and the mean convergence 
we easily find that 

(45) | f(x) - Q n (x )| = for all xe[—1, 1], 

Note that the error estimate here is smaller by 0(1/Vn) than that for the 
Legendre polynomial expansion deduced in Theorem 7. 

This argument is easily extended to give 
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THEOREM 9. Let f{x) have a continuous rth derivative on [—1, 1] where 
r > 2. Then the mean square Chebyshev approximations , Q n (x), with 
coefficients defined by (39) and (40) in (21) satisfy 

l/M - Qn Ml = 0(w 1 ~ r ). for all xe [—1, 1]. ■ 


3.5. Discrete Least Squares Approximation 

For a fixed set (x 0i x l9 ..., x M ) of distinct points we might seek to 
minimize 

2 I/M) - Qn M)] 2 

i = 0 

over all polynomials of degree at most n. Here M is usually much larger 
than the degree n of the class of approximating polynomials. The fact that 
| • | D is a norm and essentially strict, is easily shown in Problems 15 and 16. 
It then follows as in Theorem 0.1, that a minimizing polynomial exists. 
Further, the minimizing Q n (x) is uniquely determined if n < M (see 
Problem 17). 

To actually determine the discrete least squares approximation, we 
again use the notation 

(47) <f>(a 0 ,...,a n ) = If - Q n {x)\ D 2 , 
where 

Q n (x) = a 0 + a x x + ■ • • 4- a n x n . 

This function <f> is quadratic in the a { since 

(48) <f>(a 0 , o n ) = 2 / 2 M) ~ 2 2 Uk 2 *<*/M) 

i = 0 k = 0 i = 0 

n n M 

+ 22 a >^’ 2 • xf:+y - 

fc=o y = o i = o 

At a minimum, {a y } of <f>{ a) we have the necessary conditions 

= 0, k = 0, 1,..w, 

which yield the normal system 

n M M 

(49) 2^2 x i c + y = 2 fc = 0, 1,.. n. 

i =o i = o (— o 


da k 
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If n < M, the right-hand side of (49) may take on any preassigned 
values (c 0 , c u . . c n ) by suitably picking [/(x 0 X/WX .. .,/(x M )] (e.g., if 
M — n, the Vandermonde determinant \x k \ is non-singular and if M > n, 
there are fewer restrictions than variables f(x f ) to determine). Hence, the 
system (49) is solvable for any right-hand side and therefore non-singular, 
if M > n. 

But, we may avoid the necessity of having to solve the general normal 
system (49) if, in analogy with (10), we can construct a sequence of poly¬ 
nomials P n (x) 9 n = 0, 1,. . M, which is orthonormal relative to sum¬ 
mation over x t , i = 0, 1,. . M. To this end, we define an inner product by 

M 

(50) (/,*) = (*,/)= 2/(*<)£(*.)• 

i = 0 

With this inner product in the Gram-Schmidt process of Suosection 3.1, 
we can orthonormalize the independent set {x fc } for k = 0, 1, 2, . . ., M. 
The result is a set of polynomials {AW} fc r which 

M 

(51) (P r (x), P s (x)) = ^ Pr(x,)P s (x t ) = 8 rs ; r, s = 0, 1,.. M. 

i = 0 

Now the unique polynomial Q n (x) of degree at most n < M that minimizes 
(46) can be written as (see Problem 14) 

(52a) Q n (x) = 2 d kPk{x), 

k = 0 

where 

M 

(52b) d k = 2 f(x,)Pk(x,)- 

1-0 

We now consider the determination of polynomials which satisfy (51) 
over various sets of points. As indicated, the Gram-Schmidt procedure 
could be used, but for the special cases to be treated it is not required. 
First, we consider uniformly spaced points {x y } in [—1, 1]; say, 

2 

*o = - 1, x, = x 0 +jh 9 h = j = 0, 1, .... 

The corresponding orthonormal polynomials that satisfy (51) can be 
written as 


(53) 


P n (x) = C n A n (it n KxW n Kx)), 


n = 0, 1,.. M. 
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Here the forward difference operators A n are defined by: 

A °/(x) = /(x) 

Af(x) = f(x + h) ~f{x), 

(54) A 2 /(x) = A(A/(x)) = /(x + 2/*) - 2/(x + A) + /(x), 

A n /(x) 3 A[A n_1 /(x)] = 2 (~ O'(?)/(* + (n -j)h); 

y = o 

the coefficients C n are constants given in (58) and 
u [0] (jc) = i; [0] (jc) = 1 

(55) u [n! (x) = (x - x 0 )(x - xO- • (x - x B _i) 

yfl(x) = (x - X M + 1 )(x - X M + 2 ) • • -(x - X M + n ). 

Formula (53) is the discrete analog of the formula (26) for the Legendre 
polynomials. 

The fact that these polynomials satisfy (51) may be verified by the use 
of a formula for summation by parts , which is analogous to integration by 
parts. To derive this formula, we note the identity 

(56a) A(FG) = FAG + GAF + (AF)(AG) 

which can be written as 

(56b) GW = A (FG) - FAG - (AF)(AG). 

Now assume r > s and let 

F(x) = A r_1 (w [r] (x)y [ %x;)), G(;t) = A s (w [s] (x)d Cs3 (jc)). 

We evaluate (56b) at each point x { — x 0 + ih and sum over 0 < i < M 
to get 

(57) 2 p M p rM = 2 G(x,)AF(x f ) 

'^s'^r i = o t = o 

M M 

= 2 A[F(x ( )G(x,)] - 2 w G(x t ) 

f=0 i =0 

- 2 [AF(x|)][AG(x t )], 

i = 0 

XM +i M 

= F(x)G(x) - 2 7 r (x i )AG(x 1 ) 

*0 i = 0 

- 2 (AF(Xj)][AG(X()]. 

i = 0 
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We observe that F(x 0 ) = F(x M + 1 ) = 0, whence from (53) and (57) we 
have converted the sum in (51) into two sums in which AG appears. 

We may continue with this process of summation by parts to successively 
form sums in which higher order differences of G appear. The boundary 
terms vanish since, by Problem 21 and the identity (56a) for x — x 0 or 
x = * M + 1 

A /c (w [r] (jc)i; Cr] (x)) = 0, k — 0, 1, ..., r — I . 

Now, from the assumption r > s, it follows that after j + l such appli¬ 
cations, the term A 2s + 1 (u [s] v [s] ) will be a factor in all the resulting sums. 
But u [s \x)v [s] (x) = g 2 s(x) is a polynomial of degree 2s and hence all the 
sums vanish identically since for any polynomial p n {x) of degree at most n , 
the difference operator reduces the degree, i.e., 

&Pn(x) = Pn-l(x) 


A m p n (x) = 0, for m > n. 

The verification that (51) is valid for r — ^ follows when we define 

(58) c„ = f 2 /2 - 

Li = 0 

The bracket in (58) is not zero if we can show that 
A n [M [nJ (x 0 )^ En] (A:o)] ^ 0. 

But the polynomial 

p 2n {x) = u [ 1x)o [ "'W 

has only the 2 n zeros {x 0 , x u ..x M + 1 , x M + 2 ,..x M + n ) and 
M ^ yi. Then from (54) w r ith == /^ 27 i(^)» only one term in the expression 
for A n f(x 0 ) is non-zero, i.e., 

& n P2n{x 0 ) = P 2 n(x n ) # 0. 


Hence the definition (58) is valid. 

The polynomials P n (x) of (53) have been called the Gram polynomials. 
It can be shown that the polynomials P n (x)/Vl/M converge as A/->oo 
to the orthonormal polynomials defined in equation (26) (they are related 
to the Legendre polynomials). 

Another interesting set of points for discrete least squares approximation 
in [— 1, 1] are the zeros, (x 0 , x u ..., x M ) of the (M + l)-st Chebyshev 
polynomial 

(59) T m + 1 (x) = 2~ M cos [(M + 1) cos' 1 jc], M = 0, 1, 2,.... 
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In Subsection 4.2 we show that these are polynomials of the indicated 
degrees. Now the points {x y } are not uniformly spaced, but are given by 

«») *'“ cos [BTTTjl]' 

Note that the corresponding points 


are uniformly spaced in [ 0 , 77 -]. 

We say that these sets {x,} are interesting, because on the one hand, the 
discretely orthonormal polynomials are easily found in Theorem 10; and 
on the other hand,we prove in Subsection 5.1 that various approximation 
polynomials based on these points converge uniformly to any function 
/(x) with two continuous derivatives in [— 1 , 1 ]. 

theorem 10. For the discrete set of points {x,} defined in (60), the dis¬ 
cretely orthonormal polynomials satisfying (51) are proportional to the 
Chebyshev polynomials . Specifically they are: 

P 0 (x) ~ (M+ 1 )-*, 

(61) 

P n (x) == 2 1/z (M + l)~ 1/z cos (n cos -1 x), n = 1, 2,..., M. 

Proof\ We must verify that P n (x) defined in (61) satisfies (51). This 
follows directly from the discrete orthonormality property of the trigono¬ 
metric functions expressed in 


j — 0 , 1 .Af. 

2/4- 1 tt 


lemma 1. 


(62) 


where 

(63) 


M 

2 
j = 0 


r 0 

cos rdj cos sdj ~ < ^ + * 

u+. 

f> = 3 l + l - 

1 M + 1 2’ 


for 0 < r ^ s < M; 
for 0 < r = s < M\ 
for 0 = r = s; 

7 = 0, 1.Af. 


We can most readily evaluate the sum in (62) by making use of the well- 
known formula, 

(64) e ix = cos x 4 i sin x, 
where i 2 = — 1 and x is a real number. Then we may write 

M M 

(65) 2 cos cos s ®i — i 2 f cos ( r + s )Qj + cos ( r — S )Q>] 

j = 0 i - 0 

C M \ / M 

£ + + iRe 1^ 2 * i(r ~ s)<? > 
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But the right-hand sums may be treated as geometric series. That is, we 
note that 6 j — 6 0 = j26 0 , whence for R = 1 , 2 ,..., 2M, 


^ e iR6 i = e ine o ^ (e lR2d °y 

i= o j=o 

- «*••( ’ T 


= e IKe o( 


,-<R0 n (Af + 1) 


)\ e iR0 Q (M + 


= s * n R(M + \)0 o gjjiQ o (M+i) 

sin R6 0 

By taking the real part, we find 

M ' sin R(M + 1)0 O cos R{M + l)6 0 


( 66 ) 


Re 


Gt-K 


sin R0 o 
sin 2 R(M + 1 )9 0 


2 sin R6 q 
sin 7?7 t 


2 sin ; 


/?7T 


- 0, for R = 1, 2,.. ., 2M. 


2(M + 1) 

If we now identify R — r ± s y then from (65) with r > s the first part 
of (62) follows. The special case r — s > 0, of (62) results by using (65) 
and observing that the sum in (66), for R = 0 is simply 

M 

Re ^ e,oe ’ = M + 1. 

Finally, the trivial case r = s = 0, of (62) is directly verifiable. Thus, 
Lemma 1 is proven and from it follows Theorem 10. ■ 

We note the fact that for any set of M + 1 distinct points (jc 0 , x l9 . .., x M ), 

theorem 11 . The discrete least squares approximation polynomial Q M {x) 
of degree at most M which minimizes 

I fix) - Qm(x)\ d 2 = 2 [/(*t) - CmC*i)] 2 > 

t = 0 

is the interpolation polynomial for f{x) based on the distinct points 

(X 0 , X M ). 

Proof Let P M (x) be the indicated interpolation polynomial. Then 
lf(x) — FmWId — 0 and since the interpolation polynomial is unique 
we must have /^(x) = ■ 
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We recall that with least squares approximations (discrete or not) the 
next higher degree approximation is obtained by simply adding a new term 
to the previous approximation. Theorem 11 then shows that for the dis¬ 
crete case, as the degree increases for a fixed set of points, the approxima¬ 
tions “approach” the interpolation polynomial. However, for the n + 1 
unequally spaced points (60) we show in Subsection 5.1 that while the 
tfth degree interpolation polynomials converge like 1 jVrt as «->oo (for 
a sufficiently smooth function), so do the discrete least square polynomials 
of degrees > Vn. 

We observe that one reason for working with the discrete least square 
method is that sums are readily computable. On the other hand, the 
integrals of the continuous least square method, say of the form 

J f(x)P n (x)dx, 

are generally only determined approximately (frequently by using quadra¬ 
ture formulae, i.e., sums). 

The natural extension to weighted discrete least squares approximation 
is omitted. An important application of these approximation methods is 
to the art of fitting mathematical formulae to empirical data but we shall 
not treat that here. 


PROBLEMS, SECTION 3 


1.* A generalization of the Hilbert segments is furnished by the matrix 
A = (an), a u = l/(<*i + ft), /, j = 1, 2,. . ., /i, where the a t are distinct and the 
ft are distinct. Show that the determinant of A is 


det A = det 


1 

n -1 

n 

_ p=i 

fi («p - 

,<? = p +1 

CC q ) 

n — 1 

■ n 

r = 1 

n (|8r-jS s ) 

.$ = r + 1 

a l + ft 

n 

i=i 

n ( a < + ft) 

L/=i J 



for n > 2. 


[Hint: Multiply the ith row of A by fj («j + ft) for i = 1, 2, ..., n and 

call the resulting matrix C. The elements of C are polynomials in {a<, ft}, 
hence, det C is a polynomial P({a fi, {ft}) of degree at most n(n — 1). Observe 
that P is divisible by each of the factors in the numerator of the right-hand side 
because a determinant vanishes if two columns or two rows are identical. 
Hence, P equals the numerator to within a constant factor, since the numerator 
has degree rt(n — 1). Therefore, det A equals the right-hand side to within a 
constant factor K n . Determine K n by induction.] 

2.* Notice that the cofactor of any element in the above matrix, A , is the 
determinant of a matrix of similar form. Use the cofactor and the determinant 
of A to represent the elements of A" 1 = (b Jk ). Express these elements in terms 
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of the Lagrange interpolation coefficients with respect to the points a { and ft; 
the result should be 

ftk = ( a k + ft)^lc( ft)ft( a fc)) 

where 


= 


B k (x) = 



Verify, by using equation (2.12) that A A' 1 = A 1 A = I. 

3. Show that any polynomial of degree m which is orthogonal to the first 
m + 1 orthogonal polynomials (i.e., to all orthogonal polynomials of degree m 
or less) is the identically vanishing polynomial. 

4. Verify that if f{x) is continuous in [a, b] 9 (x 0y x u ... > x n ) are distinct 
points in (a, b ) and 


f(x - X) + € M ) 


Wm(x) = < 


-(* - X S - € M ) 




U) 


for X] — e M < x < Xj 9 

for Xj < x < Xi + e M , j = 0, 1,.. 

if jc is not in any of the above intervals, 


where 


. \Xi - x,\ 

€ M = min J- 

i.i M 


then the associated weighted least squares polynomial approximations 
Pu.m(x) converge to the Lagrange interpolation polynomial as M— 

5. Given the linearly independent set of functions {^(jc)} for / = 0, 1, 2,, 
n y verify that with the definition (22), the Gram-Schmidt orthogonalization 
process (15)—(16) produces an orthonormal set {/*(*)}. 

6. If >v(x) = 1, [a y b] = [ — 1, 1 ], show that for g k (x) = x k y k = 0, 1, 2,, 
the orthonormal polynomials P n (x) resulting from (14)—(16) are 

PM = (« + *)* ^ ~ (* 2 - 1 )". 

[Hint: Verify 

J* Pn{x)P m {x) dx — 8 nm 

by integration by parts and use uniqueness of Gram-Schmidt process.] 

[The polynomials p n (x) = P n (x)(n + are called the Legendre poly¬ 

nomials , and have the properties 

Pn( 1) = l; Pn(- 1) - (- i) n ; 

{recurrence relation) 

Pn + l(x) = 2r ‘ ^ 1 XP„(X ) - Pn -l(x), n > 1.] 

n l n + I 
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7. Verify that the Legendre polynomials also satisfy ( differential equation) 

(1 - x 2 )p n " - 2 xpn + n{n 4- 1 )p n = 0. 

[Hint: Let u = (x 2 - l) n and apply Leibnitz’ rule to get d n + 1 /dx n + 1 of 
both sides of 

(x 2 — 1 )«'(*) = 2 nxu.] 

8. Prove that for the Legendre polynomials 

\Pn(x)\ < 1 for |x| < 1. 

[Hint: Consider 

n(n + 1 )/(*) = n(n + l)/?„ 2 (x) + (1 - x 2 )[p n '(x)] 2 . 

Note that f(x) = p, 2 (x) if p n '(x) = 0, or x 2 = 1. But, by using the differential 
equation of Problem 7, n(n + l)f'(x) = 2 x[p n \x)] 2 ^ 0 if x ^ 0 and hence 
the value of |/> n (x)l at a local maximum point, \x\ < 1 is < L] 

9. If w(x) = (1 — x 2 )“ 1/2 , [a, b] = [-1,1], g k (x) = x k for k = 0, 1, 2,. .., 
show that the sequence defined by Problem 5 is 

P n (x) = /? cos ( n cos -1 x), n — 1,2,..., 

V 7 T 

Po(x) = ~ 

V 7 T 

[The polynomials 

T n (x) = COS (n cos -1 x), n = 1,2,..., 
r 0 (x) = 2 

are called Chebyshev polynomials of the first kind. They satisfy (recurrence 
relation ): 

r n + 1 (x) = xT n (x) - \T n ~ i(x); 

(i differential equation ): 

(1 - x 2 )T n ” = xT n f - n 2 T n .] 

10. Show that || • || 2 is a strict norm [see (2)]. 

[Hint: (a) The triangle inequality follows from the Schwarz inequality : 

\ b f(x)g(x)dx < FG 
J a 

where 

F m j£ [/(x)] 2 dx |’ /2 ; G = {£ [*(x)] 2 rfx| ,/2 - 

Observe 

0 < £ W(X) + fig(x)f dx a (of + pG) 2 + lap [£ f(x)g(x) dx - Fg\■ 
For F 0 ^ G, aF + f!G = 0 implies ocp < 0 and the inequality follows. 
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(b) Now to show strictness, note ||/+ ^|| = ||/|| + ||^|| implies 

r b 

f(x)g(x) dx = FG , whence with non-trivial a, jS, 

J<x 

0 = («F + /3G) 2 9 f [«/(*) + ft?(.v)] 2 rf*.] 

11. Given (c 0 , Ci,..c n ) find a polynomial lV n (x) of degree at most n y 
such that 

f x k lV n (x) dx — c k for k = 0, 1,..., n. 

* a 


[Hint: Use the orthonormal polynomials Pfx) y j = 0, 1defined by 
(15), (16) and (17).] 

12. * Prove Theorem 8. 

13. * Prove Corollary to Theorem 8. 

14. Given the discrete orthonormal polynomials {/^(x)}, for 0 < n < M y 

M 

on the set {xj} for 0 < j < M [i.e., ]> Pr(Xj)P 9 (Xj) = 5 rs ] verify, if n < AT, 

y = o 

that 

Qn(x) = 2 d k P k (x) with d k = 2 f(x,)P k (x,) 

k=Q j =0 

is the unique polynomial of degree at most n which minimizes 

2 l/C*/) - Qn(Xl )| 2 . 

i = 0 

[Hint: Show that 

2 I fix,) - WJx,)\ 2 = 2 !/(•*/) - Qn(Xj)\ 2 + 2 e k 2 , 

1 = 0 * = 0 te = 0 

where 

fVn(x) = e»W + 2 e k P k (x).} 

fc = 0 


15. Show that | • | D is a semi-norm. 

[Hint: |/+ g | D < l/| D + \g\o is a consequence of the Cauchy-Schwarz 
inequality (see Chapter 1, Section 4) 


2 st s 

t = 0 





] 


16. | | D is essentially strict if If + g\ D = I/Id+I^Id implies there exist 
non-trivial a, jS such that a/ t + pgi = 0 for / = 0, 1,. M, Prove |*| D is 
essentially strict. 

17. If n < M y show that the polynomial Q n (x ) which minimizes |/(x) — 
Qn{x)\ D is unique. 

18. Verify that the orthonormal polynomials {/V*)} for weight >v(x) and 
interval [a y b ] may be represented in the form 


P n (x) = c„ 


1 d n 
vv(^r) dx n 


[v n (x)l 


with c n a normalization constant, in the common classical cases: 
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(a) [ a , b] = [- 1, 1], w(x) = (1 - x)“(l + x)* with a > - 1, 0 > -1 [i.e., 
u n (x) = (1 - x)* + n (l + x) fl + n ] (Jacobi polynomials) 

(b) [a, b] = [0,oo], w(x) = e~ ax with a > 0 [i.e., ( Laguerre 

polynomials) 

(c) [a y b] = [ — 00 , 00 ], h>(x) = e ~« 2 x 2 [j #e ^ = e~ a2 * 2 ] (Hermite poly¬ 

nomials) 


[Hint: 


P n (x) s 


1 d n 
vv(jc) dx n 


[VnM] 


is a polynomial of at most degree «, if v n (x) satisfies 


£ 

dx 


n + i f 1 An "I 

7* te w* Mx)) \ = 0: 


furthermore, f dx = 0 for n > m is implied with the use of 

Ja 

integration by parts from 

/„ lx" dx = - J KMl/VM dx 


= (- D n + 1 £ ^n-7- -i KW ]PZ + 'KX) dx 


if 


dx r 


MX)] 


= 0 


for r = 0, 1,.. n — 1.] 


19. By the use of Problem 18 find another representation for the Chebyshev 
polynomials. 

20. * Prove Theorem 9 for r > 2. 

21. Verify that with the definitions (54) and (55), 


Aw [n] (je) = nhu [n u (*)» n > \. 


4. POLYNOMIALS OF “BEST” APPROXIMATION 

Another measure of the deviation between a function, /(jk), and an 
approximating polynomial of degree n , 

(1) P n (x) = a 0 + a y x -+ a n x ", 

is the so-called maximum norm : 

(2) \\f{x) - PnMUoo s max \ f(x) - P n (x)\ = D(f, P n ). 

a<x<b 

A polynomial which minimizes this norm is conventionally called a 
polynomial of “best” approximation. 
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Equation (2) defines a function of the n + 1 coefficients {a,} that is not 
as explicit as (3.4), 

(3) d(a 0 , a u ..., a n ) s max \f(x) - P n (x)\. 

a^x<b 

A polynomial of best approximation is characterized by a point a in 
(n + l)-space at which d( a) is a minimum. The existence of such a poly¬ 
nomial is shown by 

theorem 1. Let f(x) be a given function continuous in [a, b]. Then for 
any integer n there exists a polynomial P n {x), of degree at most n , that 
minimizes ||/(x) - P n (x) I*. 

Proof We shall verify the hypotheses of Theorem 0.1 to obtain 

n 

existence of a minimizing polynomial P n (x) = ^ 

( = 0 

Clearly, ||- fl*, is a norm in the space of continuous functions on [a, b]. 
We only need to establish (0.5) for this norm. That is, we must show that 
on the subset of polynomials such that 

n 

2 fly 2 = 1, min |!P n (x)|| c0 = m n > 0. 
y = o 

By the argument in the proof of Theorem 0.1, ]|P n (x)|| 00 is a continuous 

n 

function of the variables {a f }. If ^ is the closed bounded set 2 a P = 

y-o 

we may apply the Weierstrass theorem which assures us that there is a 
point {dj} for which 

m n = min ||P n (x)|| G0 , 

2 

is attained. But at this point 

n 

m n = 2 fly x> # 0, 

y = o oo 

since 

2 v -1 

= o 

and any non-trivial polynomial, of degree at most n can have at most n 
zeros (i.e., if ||/ ? n (x)|| fl0 = 0 then P n (x) = 0). ■ 

At this point we observe that || ♦ I* is not a strict norm, and hence we 
cannot use Theorem 0.2 to establish uniqueness of P n (x). Nevertheless, the 
“best” approximation polynomial is unique and we will prove this fact 
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in Theorem 3. It is of interest to note that Theorem 1 remains true if the 
norm || • H* of (2) is replaced by the semi-norm | • loo.s defined as 

l/MIOO.S S sup |/MI 

x eS 

where S is any subset of [ a , b ] containing at least n + 1 points. 

From the Weierstrass approximation theorem, it follows that 

lim D(f P n ) = 0. 

n-+ oo 

Furthermore, if fix) has r continuous derivatives in [a, b ], then by the 
convergence result for expansions in Chebyshev polynomials (see Theorem 
9 in Subsection 3.4) 

£>(/, /„) = 0{n l ~ r ) for r > 2. 

4.1. The Error in the Best Approximation 

It is a relatively easy matter to obtain bounds on the deviation of the 
best approximation polynomial of degree n . Let us call this quantity 

(4) d n (f) s min d(a Q , ...,a n )= min D(f P n ). 

{a 0 .. . a n ) (Pnix)) 

Then for any polynomial F n (x) we have the upper bound 

dJJ) < D(f P n ). 

Lowers bounds can be obtained by means of 

theorem 2 (de la vallee-poussin). Let an nth degree polynomial P n (x ) 
have the deviations from f{x) 

(5) fix,) - P n (Xj) = (-1 Ye j9 j = 0, 1,..., n + 1, 

where a < x 0 < x 1 < • • • < x n + 1 < b and all e, > 0 or else all e, < 0. 
Then 

(6) min |e ; | < d n (f). 

i 

Proof Assume that for some polynomial Q n {x), 

(7) D(f, Q n ) < min |e,|. 

} 

Then the «th degree polynomial 

QnW - P n (x) = [fix) - P n (x)] - [fix) - Q n ( X )] 

has the same sign at the points x, as does fix) — P n (x). Thus, there are 
n -f 1 sign changes and consequently, at least n + 1 zeros of this difference. 
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But then, this nth degree polynomial identically vanishes and so P n (x) = 
Q n (x ), which from (7) and (2) is impossible. This contradiction arose from 
assuming (7); hence D(fQ n ) > min \e $ \ for every polynomial Q n (x), 

and (6) is established. ■ 

To employ this theorem we need only construct a polynomial which 
oscillates about the function being approximated at least n + 1 times. 
This can usually be done by means of an interpolation polynomial of 
degree n. 

A necessary and sufficient condition which characterizes a best approxi¬ 
mation polynomial and establishes its uniqueness is contained in 

theorem 3 (chebyshev). A polynomial of degree at most n , P n (x), is a 
best approximation of degree at most n to f(x) in [a , b] if and only if f(x) — 
P n (x) assumes the values ± /)(/, P n ), with alternate changes of sign , at least 
n + 2 times in [a, b]. This best approximation polynomial is unique. 

Proof Suppose /^(jc) has the indicated oscillation property. Then let 
x } with j = 0, 1,.. n + 1 be n + 2 points at which this maximum 
deviation is attained with alternate sign changes. Using these points in 
Theorem 2 we see that \ej\ — D(f P n ) and hence 

dJJ) > D(fP n ). 

From equation (4), the definition of d n (f), it follows that D(f P n ) = d n (f) 
and the P n (x) in question is a best approximation polynomial. This shows 
the sufficiency of the uniform oscillation property . 

To demonstrate the necessity, we will show that if f(x) — P n (x) attains 
the values ± D(f P n ) with alternate sign changes at most k times where 
2 < k < n + 1, then D(f P n ) > d n (f). Let us assume, with no loss in 
generality, that 

/(*,) - /»„(*,) = (- 1 P n ), j= 1,2,.. k, 

where a < < x 2 < ■ ■ ■ < x k < b. Then, there exist points £ 2 , ■ ■ ■, 

£ k - u separating the x } , i.e., 

Xi < < x 2 < $2 < ■ ■ ■ < fk-i < x k 

and an t > 0 such that |/(f,) — -PnCf/)! < D(f P n ) ar, d 

- D(f P n ) < f(x) - P n (x) < D(f, P n ) - £, 

for x in the “odd” intervals, [a, &], [f 2 , f 3 ], [f 4 , f 5 ],...; while 

- D(f, P n ) + e< f{x) - P n (x) < D(f P n \ 

for x in the “even” intervals, [£ 1; £ 2 ], [f 3 , f 4 ],.... For example, we may 
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define = K 7 ? 1 + Si) where rj 1 = g.l.b. { 17 } for a < 77 < x 2 and /(^) - 
P n ( 7 ?) = D{f,P n ); and similarly: = l.u.b. {£} for a < £ < x 2 and 

/(£) — JP w (0 = — D(f P n ). Then x 1 < h < t} 1 < x 2 ; otherwise, we may 
insert 77 x and h in place of jc x in the original sequence and find k 4 - l 
alternations of sign. That is, alternately for each of the k intervals 
[a, fx],.. b ], the deviation /(x) — P n (x) takes on only one of the 
extreme deviations ± D(f P n ) and is bounded away from the extreme of 
opposite sign. The polynomial 

r{x) = (jc - fi)(x - &)• • (x - f fc -i) 

has degree k — 1 and is of one sign throughout each of the k intervals in 
question. Let the maximum value of \r(x)\ in [a, b] be M. Now define 
q(x) = ( — l)V(x)/2A/ and consider the nth degree polynomial (since 
k - 1 < n) 

Qn(x) = Pn(x) + eq(x), 

for sufficiently small positive <r. We claim that £)(/, Q n ) < D(f] P n ), 
and so P n (x) could not be a best approximation. Indeed, in the interior 
of any of the “odd” intervals (a, (f 2 , £ 3 ),..., we have that —-J- < 

q{x) < 0 and conversely in the “even” intervals f 2 ), (? 3 > £ 4 ), . * we 
have that 0 < q(x) < However, recalling the above inequalities, 

-D(f,P n ) - tq(x) < f(x) - Q n (x) < D(f,P n ) - e[l 4 q(x)], 

x in odd intervals; 

-D{f,P n ) + <[l - q{x)] < f{x) - Q n (x) < D(f P n ) - c q(x), 

x in even intervals. 

From the signs and magnitude of q(x) in each interval, it easily follows that 
Dif, Qn) < D{f P n ) and the proof of necessity is completed. 

To demonstrate uniqueness we assume that there are two best approxi¬ 
mations say, P n (x) and Q n (x) y both of degree at most n. Since by assump¬ 
tion £>(/, P n ) = D(f Q n ) = </„(/), we have in [a, b], 

I Ax) - HPJtx) 4 Q n (x))\ < i\f(x) - P n (x) | 4 ±| f{x) - Q n (x) | 

< </»(/). 

Thus, ^[/^(x) + Q n (x )] is another best approximation and we must have, 
by the first part of the theorem, 

I AX) ~ i[Pn(x) 4 Q n ( X )] | = d n (f) 

at n 4 2 distinct points in [<a , b]. From the inequality, it follows that at 
these points f(x) — P n (x) = /(x) — 2 n (x) — ±d n (f). Thus, the difference 
[/(x) - P n (x)] - [/(x) - Q n (x)] = Q n (x) - P n (x) vanishes at n 42 
distinct points. Since this difference is an nth degree polynomial, it vanishes 
identically, i.e., Q n (x ) = P n (x), and the proof is complete. ■ 
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This theorem can be used to recognize the best approximation poly¬ 
nomial. It is also the basis, along with Theorem 2, of various methods for 
approximating the best approximation polynomial. There is no finite 
procedure for constructing the best approximation polynomial for arbi¬ 
trary continuous functions. However, the best approximation is known in 
some important special cases; see, for example, the next subsection and 
Problem 2. 

As an obvious consequence of Theorem 3, it follows that the best approxi¬ 
mation, P n (x ), of degree at most n is equal to /(x), the function it approxi¬ 
mates, at n - 1-1 distinct points , say x 0 , x u . .., x n . Thus, P n (x) is the 
interpolation polynomial for /(x) with respect to the points {x*} (since by 
Lemma 2.1 the interpolation polynomial of degree at most n is unique). 
Of course, for an arbitrary continuous function, /(x), a corresponding 
set of interpolation points {xj is not known a priori. Thus, this observa¬ 
tion cannot, in general, be used to determine P n (x). However, if/(x) has 
n + 1 continuous derivatives, Theorem 2.1 applies since P n (x) is an inter¬ 
polation polynomial, and we have determined a form for the error in the 
best approximation of degree at most n. In summary, these observations 
can be stated as a 

corollary. Let /(x) have a continuous (n + 1 )st derivative in [ a , 6] and 
let P n (x) he the best polynomial approximation to /(x) of degree at most n 
in this interval. Then , there exist n + 1 distinct points x 0 , x 1} . . ., x ft in 
a < x < b such that 

(8) R n (x) = f{x) - P n {x) = — / <n + 1) (f), 

where f = f(x) is in the interval: 

min (x, x 0 ,..x n ) < £ < max (x, x 0 ,..x n ). ■ 

4.2. Chebyshev Polynomials 

In the expression (8) for R n (x), the error of the best approximation, it 
will, in general, not be known at what point, f = £(x), the derivative is to 
be evaluated. Hence, the value of the derivative is not known. An exception 
to this is the case when / (n + 1) (x) = constant, which occurs if and only if 
/(x) is a polynomial of degree at most n + 1. In this special case, the 
error (8) can be minimized by choosing the points x 0 , x l5 ..., x n such that 
the polynomial 

(9a) (x - x 0 )(x - Xl )---(x- x n ) 

has the smallest possible maximum absolute value in the interval in question 
(say, a < x < b). In the general case, the choice of these same interpola- 
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tion points may be expected to yield a reasonable approximation to the 
best polynomial, i.e., the smaller the variation of / <ft + 1) (x) in [a , b], the 
better the approximation. 

We are thus led to consider the following problem: Among all poly - 
nomials of degree n + 1, with leading coefficient unity, find the polynomial 
which deviates least from zero in the interval [ a , b]. In other words, we are 
seeking the best approximation to the function g(x) = 0 among poly¬ 
nomials of the form 

(9b) * n+1 -PJtx), 

where P n (x) is a polynomial of degree at most n. Alternatively, the problem 
can then be formulated as: find the best polynomial approximation of 
degree at most n to the function x n + 1 . 

For this latter problem, Theorem 1 is applicable and we conclude that 
such a polynomial exists and it is uniquely characterized in Theorem 3. 
Thus, we need only construct a polynomial of the form (9) whose maximum 
absolute value is attained at n 4- 2 points with alternate sign changes. 

Consider, for the present, the interval [a, b] = [—1, 1]. We introduce 
the change of variable 

(10) x = cos 6 y 

which takes on each value in [—1,1] once and only once when 0 is 
restricted to the interval [0, n]. Furthermore, the function cos (n 4- 1)0 
attains its maximum absolute value, unity, at n + 2 successive points with 
alternate signs for 



Therefore, the function 

(11) T n + 1 (x) = A n + 1 cos (n + 1)0 = A n + l cos [(« + ^cos’ 1 x] 9 

has the required properties as regards its extrema. To show that T n + 1 (jc) 
is also a polynomial in jc of degree n + 1 we consider the standard tri¬ 
gonometric addition formula 

(12) cos (n 4- 1)0 + cos (n — 1)0 = 2 cos 0 cos n6 , n — 0, 1 , .... 

Let us define 

(13a) r n (x) s cos (n cos -1 x), n = 0, 1, 2,. . ., 

in terms of which (12) becomes 
(13b) r n+ i(x) = 2f l (*K(jc) - t n -i(x). 


n = 1, 2, 3, — 
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Clearly, from (13a), t 0 (x) = 1, ^(x) = x and so, by induction, it follows 
from (13b) that / n + 1 (x) is a polynomial in x of degree n + 1. It also follows 
by induction that 

(13c) / n + iW = 2 n x n + 1 + q n (x\ n = 0, 1,2,..., 

where q n (x ) is a polynomial of degree at most n. Thus, with the choice 
A n + 1 = 2~ n in (11), these results imply 

(14) T n + 1 (x) = 2“ n cos [(« + 1) cos- 1 x] = 2~ n r n + iW 

= x n + 1 + 2 ~ n q n (x). 

At the n + 2 points 

kir 

(15a) ^ = cos — 0, 1,..+ 1, 

n + l 

which are in [—1, 1], we have from (14) 

(15b) 7; + 1 (f fc ) = 2“ n cos Jctt = 2- w (-i) fc . 

Thus we have proven that 7" n + 1 (x) is the polynomial of form (9) which 
deviates least from zero in [—1, 1]; the maximum deviation is 2~ n . 

The polynomials in (14) are called the Chebyshev polynomials (of the 
first kind—see Problem 9 of Section 3). If the zeros of the (n + l)-st such 
polynomial are used to construct an interpolation polynomial of degree 
at most n , then for x in [—1, 1] the coefficient of/ (n+ n (0 in the error (8) 
of this approximation will have the least possible absolute maximum. 

If the interval of approximation for the continuous function g(y ) is 
a < y < b, then the transformation 

(i6) * = - ~ * ~ or y - i(( 6 -«)* + («+ *)] 

converts the problem of approximating g(y) into that of approximating 
f{x) s g[>>(x)] in the x-interval [—1, I]. The zeros of T n + 1 (x) are at 

(17a) x k = e° s *“0,1 . n; 

and the corresponding interpolation points in [a, b] are then at 
(17b) y k = \[{b - a)x k + (a + 6)], k = 0, 1,.. n. 

n 

The value of the maximum deviation of Y\ (y — y f ) from zero in [a, b ] 

/ = o 

is then, using (16) and (17b): 


1 b — a n + 1 
2 n 2 


n 

( 18 ) max U\y-y,\ = 

a<y<b j-Q 


max n(*- x k ) 
-ISaiSI , = 0 
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We stress that when the points in (17b) are employed to determine an 
interpolation polynomial for g( y) over [a, b ], this polynomial will not, in 
general, be the Chebyshev best approximation polynomial of degree n. 
However, these points may be used to get an estimate for the error of the 
best approximation polynomial. Iterative methods, based on Theorems 
2 and 3, have been devised to compute with arbitrary precision the poly¬ 
nomial of best approximation. 


PROBLEMS, SECTION 4 

1. Prove that the nth Chebyshev polynomial can be expressed as: 

T„(x) = 2 - "[(x + V* 2 - 1)" + (x - Vx 2 - I)"]. 

2. Find the best approximations of degrees 0 and 1 to f(x) e C 2 [a, b] 
provided f "(x) ^ 0 in [a, b] [i.e., calculate the coefficients of these best 
approximation polynomials in terms of properties of /(*)]. 


5. TRIGONOMETRIC APPROXIMATION 

We say that *S n (x) is a trigonometric sum of order at most n, if 

n 

(la) S n (x) — i a o + 2 cos sin kx). 

k= 1 

The coefficients a k and b k are real numbers. By using the exponential 
function 

(lb) e ie = cos 8 + i sin 0, cos 8 = \{e ie 4- e ~ w ), 

and sin 8 = ( e iG — e ~ ie ) 

where now i 2 = — 1* it follows that (la) can be written in a simpler form 
with simpler coefficients: 

(lc) S n (x) = y C k e ikx . 

k = -n 

Here 

Co = \{a k ibf) 9 = — ^(a k + ib k \ 

for k — 1, 2,..., n. 

A basic result on approximation by such trigonometric sums is again due 
to Weierstrass and can be stated as: 
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THEOREM 1 (weierstrass). Let f(6) be continuous on [-tt, 7r] and periodic 
with period 2n. Then for any e > 0 there exists an n — n(e) and a trigono¬ 
metric sum , S n (6 ), such that 

I /(<?) - s„(0)| < . 

for all 0. 

Proof A proof of this theorem can be given by employing the Weier¬ 
strass polynomial approximation theorem of Section 1. 

We sketch a simpler proof based on the Weierstrass polynomial approxi¬ 
mation theorem for a continuous function g(x , y) in the square, — 1 < x, 
y < 1 (see Problem 2, Section 1). Define g(x, y) ~ pf(9) for x — p cos 0, 
y ~ P sin 9, 0 < p < 2, —7r < 9 < v. 

Clearly, g(x, y) is defined and continuous in the square. Hence, given 
€ > 0, there exists a polynomial P n (x, y) such that | g(x, y) — P n (x, y)\ < 
But, then for p = 1, we have g(x> y) s /(^), x 2 3 + >> 2 = 1, and therefore, 
|/(0) — /^(cos 0, sin 0)[ < e. We leave as Problem 3, the verification that 
P n (cos 0, sin 0) may be written as a trigonometric sum, S n (9). ■ 

We proceed to show that the previous methods of polynomial approxi¬ 
mation have corresponding trigonometric counterparts. 

5.1. Trigonometric Interpolation 

If the points of interpolation are equally spaced , it is relatively easy to 
determine a trigonometric sum which takes on specified values at the 
appropriate points. Let /(x) be continuous and have period 2 ? 7 . For this 
section only, we introduce the convention 

n n 

2 ' = 2 a > ~ + a -»)• 
i = —n }<= -n 

With this notation, we define the trigonometric sum 

(2) U n (x) = 2' 

i- - n 

On the interval [— 7 r, 77 ] we place the 2« + 1 equally spaced points 

(3) x k — kh , k — 0, + 1, ±2,..., ±n, h = ~ 

The interpolation problem is to find coefficients such that 

(4a) U n (x k ) = f(x k ), k = 0, ± 1,.. ±n. 

Later we consider interpolation on a different set of uniformly spaced 
points. Since/(x) and U n (x) have period 2 t r,/(x ft ) == /(x_ n ) and ^/ n (x n ) = 
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U n (x - n ), so that there are only In independent conditions in (4a) to deter¬ 
mine the In + l coefficients, c k . We require that 

(4b) c n = c_ n , 


as the extra condition, and it will be shown to be consistent with the 
conditions (4a). 

A simple calculation, based on summing a geometric series (see Sub¬ 
section 3.5), reveals that 


0 ify =£ m (mod 2«),f 

2n if j = m (mod In). 

In direct analogy with orthogonal functions over an interval, we see that 


(5) 


r 


n 


the quantities {e ikx i} are orthogonal with respect to the summation ^ * 

/ = -n 

Hence, we set x — x k in (2), multiply both sides of (2) by e ~ imx * and sum 
with respect to k to find upon the use of (5): 


n n n 

2' e ~ imx k U n (x k ) = ^ e~ imx K 2' Cje Ux k, 

fc= — n k - n j=—n 


Cj ^ e ijXk e imXk 

j--n k = - n 

2nc m if \m\ < n , 

2»( Cn + 2 C - n ) if \m\ = n. 

By applying the conditions (4), then 

(6) C, - 1 2' J - 0, ± 1, • •±H. 

= - n 

It is now easy to check, by using (5), that the trigonometric sum (2) with 
coefficients given by (6) satisfies the conditions (4); i.e., the required 
interpolatory trigonometric sum is determined. 

If we define new coefficients a y and p, by 

Of, = Cj + C-j, Pj = i(Cj - c _ j), j = 0, 1, 2,..., /J, 

then the sum (2) becomes, upon recalling (4b) and (lb), 

n n - 1 

(7) U n (x) = i« 0 + 2 “j cos/x + 2 Pi &in jx- 

y=i y=i 




t If j — m is an integral multiple of 2/i, we say that j and m are congruent modulo 
2n , or we write j = m (mod 2n). If not, we write m (mod 2n). 
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From (6) it follows that these coefficients are real numbers given by 

1 n / 

= - y f(x k ) cosjx k , 

n k= -n 

(8) 

1 U / 

ft = - ^ f(x k ) sin jx k , 

n k= -n 

Equations (7) and (8) are the real form for the trigonometric interpolation 
sum. This form is suitable for computations without complex arithmetic. 

It can be shown that the trigonometric sum (7) satisfying the conditions 
(4) is unique. This follows from Lemma 2 in Subsection 5.3. 

We may also determine unique interpolatory trigonometric sums of 
order n that take on specified values at 2n + 1 distinct points arbitrarily 
spaced in, say, — tt < x < it (not including both endpoints). The coefficients 
of such a sum are the solutions of a non-singular linear system. The 
non-singularity of this system and the interpolating trigonometric sum are 
treated in Problems 1 and 2. 

However, another trigonometric interpolation scheme for equally spaced 
points can be based on the orthogonality property expressed in Lemma 1 
of Subsection 3.5. That is, using 8 as the independent variable, we consider 
the n + 1 points, 

(9) 0, = e 0 + JM, e 0 = ^. 

These 8 j are equally spaced in [0, tt], there may be an odd or even number 
of them, and they do not include the endpoints [in contrast to those points 
in (3)]. Now we seek a special trigonometric sum of order n in the form 

(10a) CM = iy 0 + 2 yrcosrfl, 

T = 1 

such that for some function g(d ), continuous in [0, tt], 

(11) C n (0,) = g(0 s ) 9 y = 0,l,...,*. 

That is, we seek to interpolate g(8) at the points (9) with a sum of the form 

(10) . Using the form (10) in (11) we multiply by cos s8j and sum over j 
to get by (3.62) 

(l0b) y s = —Lr j> g(6,) cos sdj, S = 0, 1,.. n. 

n ' 1 j =0 

Thus, the interpolation problem is solved with the coefficients (10b) in the 
trigonometric sum (10a). [Compare the formulae (8) and (10b).] We note 


7 = 0,1,..., n, 

j = 1 , 21 . 
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that the sum (10a) is an even function of 9. Hence, if the same is true of 
g{9 ), then C n (0) is the interpolation sum at 2(« + 1) equally spaced points in 
[-77,77]. Again, uniqueness of the //th order sum easily follows in this 
case from Lemma 2 of Subsection 5.3. If g{6) is not even, then the approxi¬ 
mation (10) should be used only over [0, n]. 

An important convergence property of this approximation procedure is 
contained in 

theorem 2. Let g(B) be an even function with period 2tt and a continuous 
second derivative on [ — 77,77]. Then the trigonometric interpolation sums 
C n {6) given by (10a and b), which satisfy (11) on the equally spaced points 
(9), converge uniformly as n 00 to g{9) on [ — 77, 77]. In fact , 

\g(6) - C„(0)| = 0^, for all \0\ < n. 


Proof We first estimate the rate of decay of the coefficients y s for large 
s. Clearly from (10b), since |cos s0\ < 1 , 


( 12 ) 


|y.l ^TTTli 

11 ' 1 j = 0 

< 2 max |g( 0 )|. 

0<0<-T 


Such a bound holds for the coefficients in any sum of the form (10). 

With the spacing A0 = 77 j(n + 1) of (9), we define the function 

(13) G(*) . 

This function satisfies the same smoothness and periodicity conditions as 
g(6). If we set 

(14) B n (6) = (^) 2 [C n (0 + A0) - 2 CM + C n (9 - A*)], 
then from (11) and (13) it follows that 


BM) = G(^), y = 0, 1. 

and so, B n (9) is the unique / 7 th order trigonometric interpolation sum for 
G(9) with respect to the points 9 j in (9). If we use (10a) and the identities 

cos (</> + h) — 2 cos <f> + cos {<f> — h) — cos *f >(2 cos h — 2 ), 

= —4 sin 2 ~ cos <f >, 
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in (14), we obtain 

Bn( 6 ) = - 2 Vr sin2 C0S rd - 

Thus, B n (9) has the form (10a) and by the remark after (12) it now follows 
that 

(15) |y,r 2 (^ sin < 2 max^ [G(0)|, r = 1,2 

From Taylor’s theorem 

g(6 ± A0) = g(6) ± A eg'(9) + ^ g'\e ± <f> ± A8), o < 4> ± < 1. 

By using this and the continuity of g"(Q) in (13), we find 
|G(0)| < | max \g"(<p)\, 0e[O, n]. 

This bound and the inequality (see Problem 5) 


for 0 < h < 2 * 


in (15) yield finally 


\vA ^ £2 max Is'WI. 

z.r o<e^n 


= 0^)* r=l, 2 


This is the required estimate of the coefficients in (10). 

In Subsection 5.2 we define Fourier series, and for g(6) as above it 
follows from Theorem 3 that 


(17) |g(0) - S m (0)| = a m = m = 1,2,... 

where the partial sums, S m (6 ), and coefficients, a m , are defined by 

n m 

(18a) S m (0) = + T a k cosk6, m = 1,2,... 

1 = i 

(18b) a* = - f g(0) cos k6 d8, k = 0,1,2,... 

77 J-n 

[The sine terms are absent since g(0) is even, and hence the b k = 0.] For 
any m < n, we define the truncated trigonometric interpolation sum 

m 

(19) C n , m (6) = iy 0 + 2 Yr cos rO, 


where the y r are given in (10b). 
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We now consider 

| g (0) - cm = | g(6) - sje) + sje) 

- c n ,je ) + c n . m (0) - cm |, 

( 20 ) 

< |g(0) - sm + \S m (8) - c n , m {8 )I 

+ | C n ,M - C n (9 )|. 

From (19), (10), and the estimate (16) we have 
(21) \C n .M - C„(0)| = I 2 y r cos r0\. 


2 IVrl. 


< y max |g"(0)| 2 4 

^ r=m+1' 

-<)■ 


To estimate the middle term in (20) we note from (9), (10b), (18b), and 
the evenness of g(0) that 


a* - 7k 


= - f g(0) cos kdde - y gid,) cos kdj M 

^ Jo y — q 


* = 0, 1,.... 


This sum is clearly an approximation to the integral. It is, in fact, the 
midpoint quadrature formula (see Chapter 7) and since the integrand has 
a continuous second derivative it is easily shown that (see Problem 6) 

|a * - * i2(„ +1) 2 |S [g(8) cos ke] 

j&(k 2 /n 2 ), k > 1, 

\o(l/n 2 ), k = 0. 

Using this estimate we have 

m 

|Sm(0) - C».m(0)l = i («0 - Yo) + 2 ~ y fc ) cos k0 , 


< iki)+ J 

77 I fe = i 
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Thus (17), (21), and (22) in (20) imply 

i*w - cm = + £)• 

Finally, we set m — Vn. ■ 

Several interesting corollaries easily follow. First, we have 

corollary 1. Under the hypothesis of Theorem 2 let V n < m < n and 
Cn, m (^) be the mth order trigonometric sum defined by (19) and (10b). 
Then , 

(23) |g(0) - C n . m (0)| = 

Proof. It follows as in the theorem from (17), (22), and as in (21) 
that |C n . v-n(0) - C». n (0)| = m/Vn). ■ 

Thus various truncated trigonometric sums may furnish as good an 
approximation as the entire «th order sum. Next, by changing variables 
we obtain a result on the convergence of interpolation polynomials for 
special unequally spaced points. We state this as 


corollary 2. Let f(x) have a continuous second derivative on [—1, 1]. 
If P n (x) is the interpolation polynomial of degree at most n for /(x), based 
on the n + 1 points 



then P n (x) converges to /(x) on [—1, 1] as n ->oo. In fact , 


I/M - AMI = 


Proof We introduce the new variable 0 in [0, tt] by 


and then define 


0 = cos 1 x, 
g(0) = /(cos ff). 


We make g(0) even and continuous, and 2 tt periodic by setting g(— 0) = 
g(0). Thus, the points (24) become the points 0, of (9) and g(0) satisfies 
the hypothesis of Theorem 2. Now the «th order interpolatory trigono¬ 
metric sum (10) for g(6) becomes the interpolation polynomial of degree 
at most n in x (represented in terms of the Chebyshev polynomials) upon 
using the indicated variable change. So we have 

l/M - AMI = \g(d) - C n (6)\ = ■ 
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If we use the variable change and function/( jc), we find from Corollary 1 
that for some polynomials , call them P n , m {x), of degree at most m , where 
Vn < m < n, 

I/M - A.mWI = e>{^- 

[Note that the P n m (x) are not obtained by simply truncating the inter¬ 
polation polynomial, /^(x).] It is not difficult to show however, that 
P nm (x) Is the mth degree discrete least squares approximation to f{x) with 
respect to the n -f l points (24), see Problem 7. Thus we have established 
uniform convergence of the discrete least squares polynomials when the 
n + 1 points are as in (24) and the degree is at least Vn. Our present 
estimates of convergence indicate that no improvement occurs by inter¬ 
polating at these n + 1 points with a polynomial of degree n. This is 
another indication that high order interpolation polynomials should be 
avoided. 


5.2. Least Squares Trigonometric Approximation. Fourier Series 

If f(x ) is periodic of periodf 2tt and square integrable on [ — tt, tt], 
we can seek a trigonometric sum of form (la) for which 

(25) ||/ - S B || a = [f(x) - S n (x)] 2 dxj 2 

is a minimum with respect to all such sums. This norm now defines a 
quadratic function of In + 1 variables 

J(a 0 , a u ..a n , b u ..., b n ) = ||/ — 5 n || 2 2 

which can be minimized as was (3.8). The trigonometric functions satisfy 
the orthogonality relations 


(26) 


f cos jx cos kx dx 
J — n 

j sin jx sin kx dx 
J —n 

sin jx cos kx dx = 0. 

J — 71 


0 , 


j k, 
j = k 0, 
j * k, 
j = k ^ 0, 


By using these results in the normal system obtained by minimizing (25), 
we find in analogy with (3.11), 

1 if* 

(27) a k — - cos kx f(x) dx , b k — - sin kx f{x) dx. 

77 J ~ n 77 J -n 


t If the period of f(x) is some number /?, then the change of variable £ = Inx/p 
results in a function g(£) = f(p£!2ir) which has period 2 tt . 
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The trigonometric sum (la) with coefficients given in (27) determines the 
best least squares approximation of order n to f{x) by such sums. 

We deduce as in Section 3 the corresponding Bessel’s inequality [by 
using (26) and (27) in ||/- S„|| 2 2 > 0]: 

(28) ia 0 2 + 2 (M + V) < - f f\x) dx. 

k= 1 77 J -n 

Since the right-hand side is independent of n, we also conclude that 

oo 

i a ° 2 + 2 ( a * 2 conver g es and that 

tc= 1 

LEMMA 1. 

lim a k = lim b k — 0. ■ 

fc -»00 {(-*■ 00 

The trigonometric sum (la) is, of course, the nth partial sum of the 
infinite series 

00 

(29) \a 0 + 2 ( fl * cos kx 4- b k sin kx). 

k= 1 

With coefficients given by (27), this is the Fourier series associated with the 
function f(x). 

We can now state 


theorem 3. Let f(x) be continuous and 2n periodic . Then the partial sums 
S n (x) of the Fourier series , with coefficients defined in (27), converge in the 
mean to f{x) and ParsevaTs equality holds . That is , 

(30a) lim f [f(x) - S n (x)] 2 dx = 0 

(30b) a 4- + 2 a* 2 + V = - f [f(x)f dx. 

^ k = 1 71 J -71 

Proof. Simply modify the proof of Theorem 3, Section 3. ■ 

theorem 4. Let f{x) have two continuous derivatives and be 2tt periodic. 
Then 

Kl = c '(p)’ 1**1 = 

and 

l/M - S n (x)\ = for -n < X < rr. 

Proof Let a k \ b k be the Fourier coefficients corresponding to the 2tt 
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periodic and continuous function f"(x). Now, by repeated use of integration 
by parts, 

If 71 k C 71 

a k " = - cos kxf"(x) dx = - sin kxf'(x ) dx 

7 T J — jt ^ J — 71 

= — — f cos kx f(x) dx = — k 2 a k 

7T J _ n 

and similarly b k — — k 2 b k . 

But by Lemma 1, b k " 0, and a k " -> 0, whence 

kl = and b k = 

We therefore know that the Fourier series converges uniformly to a 
continuous function g(x). Hence, we may let «oo under the integral in 
the statement (30a) of mean convergence, thus proving that f(x) = 
lim The error estimate \f(x) — S n (x )| = (9(\/n ) follows from the 

n -»oo 

boundedness of {cos kx , sin kx} and the relation 



The theory of approximation by orthogonal functions owes much to J. 
Fourier, who employed trigonometric series of the form (29). In fact, 
least squares approximations of the form (3.7) with orthogonal poly¬ 
nomials or other orthogonal sets of functions are generally called Fourier 
series (assuming n —>oo) and the coefficients given by (3.11), (3.21), or 
(27) are called the Fourier coefficients. 

Finally, we observe a close connection between the trigonometric 
interpolation coefficients (8) and the Fourier coefficients (27). Recalling 
the definitions of 2' and the points x k in (3) we can write (8) as 

_ i v r c ° s ./*fc-i/(**-i) + gos/Xjjfrj b v x 

a j 2. I o I ■*!<:-1/» 

77 k=l-n L ^ J 

B =- y [ sin ^-i/(^-i) + sinySfc/CxfcX U _ . 

J 7r (t = X- nL 2 j k k 1 

As n ->oo we have x k — x k - x = t r/n -> 0 and [say for piecewise continuous 
f{x)] these sums converge to the corresponding integrals in (27). Thus, 
(a y , fa) -> (dj, bj) and the trigonometric interpolating sum (7) converges, 
formally, to the Fourier series (29). 

These sums correspond to the trapezoidal rule of numerical integration. 
On the other hand, the coefficients y ; in (10b) for trigonometric inter¬ 
polation with respect to the points 9 j in (9) approximate the coefficients 
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by the midpoint rule for evaluating the integrals in (27). In the proof of 
Theorem 2, it is shown that for fixed j: |y ; — a s \ = (P(\/n 2 ). [For the case 
that the function g(0) is not necessarily even, the corresponding trigono¬ 
metric interpolatory coefficients are defined in Problem 4, and similarly 
converge to the Fourier coefficients.] 

5.3. “Best” Trigonometric Approximation 

If f(x) is continuous on [ — n, tt] we can seek a trigonometric sum (1), 
of order n, which minimizes the maximum norm 

11/00 - S.OOIU = max I/O) - S n (x)\. 

-n<x^n 

The existence of such a best trigonometric approximation could be 
demonstrated by using an analogue of Theorem 0.1. Another proof is 
given in Problem 9. 

Results analogous to those in Theorem 4.2 and Theorem 4.3 are also 
valid for the best trigonometric approximation of order n. A careful 
glance at the proofs of these theorems reveals that the only property of 
polynomials employed is the fact that if a polynomial of degree n vanishes 
at n + 1 points, then it vanishes identically. Such a property is also true 
of trigonometric sums. In fact, best approximation by other sets of func¬ 
tions is possible and the property they must possess to insure a unique 
best approximation is called the Haar property , defined as follows: 

A sequence of functions {/o(*)»/iM,...} has the Haar property if for 
every m the only linear combination 

P m {x) = flo/oM + fli/iW + ■ * • + a m f m (x) 

with m + 1 distinct zeros f is the identically vanishing combination 
P m (x) = 0. It was proven by Haar that these conditions are necessary 
and sufficient for uniqueness. However, we shall be concerned only with 
the trigonometric case. Thus, we consider 

lemma 2. The sequence of trigonometric functions 

{1, cos jc, sin x, cos 2x , sin 2x,..., cos nx , sin nx ,.. .} 
has the Haar property. 

Proof We need only show that every non-trivial trigonometric sum, 
(1), of order n has at most In roots in —77 < x < 77 . Let us define £ — e ix 
and note that |f| = 1. Then we have from (lc), 

5„(x) = r n 2 c k r +k . 

fc= -n 

f If the f k (x ) are periodic, then the zeros must all lie in a period. 
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The sum on the right-hand side is a polynomial in £ of degree at most In . 
Thus, this sum has at most 2 n roots if it has any non-zero coefficients. ■ 

This lemma can be used to prove uniqueness of the trigonometric 
interpolation problems solved in Subsection 5.1 and Problem 2. We 
apply the lemma to prove the analog of Theorem 4.2, namely 

theorem 5. Let f(x) e C[ — 77 , n] and let an nth order trigonometric sum , 
5 n (x), have the deviations from fix) 

fix,) - S n (xj) = (- 1 fe h j = 0, 1,..., In + 1, 

where —tt < x 0 < jc x < • • • < x 2n + i ^ n cmd all e j > 0 or all e j < 0. 
Then 

min \e t \ < min \\f(x) - 5 B (x)||«. 

i (S n ) 

Proof Assume that for some trigonometric sum of order n , say S n *(x) 9 
II fix) - VOOIU < min |^|. 

i 

Then, the /zth order sum 

S n *(x) - S n (x) - [f{x) - S n (x)] - [f(x) - S n *ix)] 

has the same sign at the points x f as does fix) — S n (x). Thus there are 
In 4- 1 sign changes and at least 2 n 4- 1 zeros of this difference in ( — 77 , it). 
But then, by the above lemma, this trigonometric sum vanishes identically 
and so S n (x) = *S n *(x) which leads to a contradiction. ■ 

Continuing the analogy, we have in place of Theorem 4.3 

theorem 6. A trigonometric sum of order n , 5 n (x), is the best trigonometric 
approximation of order n to fix) e C [ — 77 , n] if and only if fix) — S„(jc) 
assumes the values ± ||/(x) — with alternate changes of sign , at 

least 2« + 2 times in [ — 77 , 77 ]. This best approximation is unique . 

Proof The proof is exactly analogous to that of Theorem 4.4. To 
show sufficiency we employ Theorem 5. To demonstrate necessity we must 
construct a trigonometric sum of order 2k , say, and which has specified 


zeros at distinct points f x , 

■ ■ (>2k in (- 

77 , 77 ). This is done by forming 

the determinant 




1 cos X 

sin x 

cos kx 

sin kx 


/x 1 COS 

sin £ 1 

cos k^i 

sin 


(31) /(x) = 




• 

1 cos Uk 

sin f 2fc • ■ • 

cos k£ 2k 

sin k£ 2k 
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The expression r(jt) is used in place of the polynomial r(x) to obtain a 
contradiction [Note the relation of /(x) to the determinant in Problem 1.] 
The uniqueness follows by using the Haar property of the trigonometric 
functions. The details are left to the reader. ■ 


PROBLEMS, SECTION 5 


1. Let - 

■ 77 < 

*0 

< x x < • 

' • < *2n < 77 > 

and define the determinant of 

order 2 n 4- 

1. 








1 

cos x 0 

sin x 0 

cos nx o 

sin A7x 0 




1 

COS *1 

sin x i 

cos nx i 

sin wXi 



A = 









1 

COS x 2n 

sin x 2n * * • 

cos nx 2n 

sin nx 2n 



Show that A ^ 0 and in fact, that 

A = (— 1 )tl(n - l)/2 2 2" 2 fj |jj sin 

[Hint: Express sin 9 and cos 9 in exponential form, form linear combina¬ 
tions of successive pairs of columns so that only one exponential appears in 
each element, rearrange columns so that each row forms a geometric progres¬ 
sion. The result is a Vandermonde determinant.] 

2. Show that S n (x ) is a trigonometric sum which satisfies S n (xj) = 
j — 0, 1.2«, where {x k } is given as in Problem 1 and 


S n (x) = 


2n 2n i 

2 /(*>) n 

}-0 fc =0 

ik* j) 



This is the general interpolatory trigonometric sum of order n . 

[Hint: Use Problem 3 and sin a sin b — \ [cos {a — b) — cos (a + b )].] 

3. Verify that a trigonometric polynomial of degree at most n, 

P n ( cos 9 , sin 9) s 2 c 1# ,(cos 0) f (sin 9)*, 

i+j<n 

may be written as a trigonometric sum of order at most n y 


n 

P n { cos 6, sin 9) =s S n (9) = 2 a * cos k9 + b k sin k9 y 

fc = 0 

and vice versa. 

[Hint: Use (lb).] 

4. Given g(x) is 2?t periodic, find the trigonometric sum 

S n (x) — + 2 ( a * cos kx + b k sin kx) 4- -~sin nx 

£ fc -1 £ 
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Sn(Xi) = g(xj) for X, = 


( 2 j + 1)7T 


-if < / < ft — 1. 


[Hint: Establish 2 cos rXj cos 5X, = 

J= -n 


0 r * 5 

n n > r — s > 0 

2/i r = 5 = 0 

0 r = s = n 


n- 1 

2 cos rx> sin jx^ = 0 

i- -n 


n - 1 

2 sin rxy sin sx* = 


0 

r ^ s 

n 

n > r = s > 0 

0 

r = s = 0 

In 

r = s = n, 


whence 


5. Verify that 


1 n — 1 

Of = - 2 Si x i) cos rx )> 

n 1 = -n 
1 n_ 1 

*. = - 2 *(*/) sin 
« 


sin 01 < 2 


provided |0| < n/2. 

[Hint: Consider the chord joining (0,0) and (7r/2, 1) on the graph of 
y = sin x.] 

6. Let f"{x) be continuous in [ a y b] and 


, h , b - a . A - 

x, = *o + jK Xo = a + -> h = ——7» j = 0, 1,.. 

z /i + i 


Show that 


|£| 3 if f(x)dx - 2 /(*>)A| 5 —^ h 2 max |/'(x)|. 

Ja / = 0 ZH a<x<& 


[Hint: Write 

n px f + hl2 

E = 2 I t/W - /(•*#)] dx 

j = 0 J Xj —h/2 

and then use Taylor’s theorem in each integrand to get 


fix) = fix,) + (x- x,)f'{x,) + (X 2 X,f r(x s + 0 


-1 < 0 < 1.] 
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7. Show that the mth degree discrete least squares polynomial approxima¬ 
tion to /(x) with respect to the n + 1 points (24) is obtained by truncating the 
«th order trigonometric interpolation sum (10) for g(x) on the points (9) 
after the m th term and using the variable change 8 = cos -1 x, g(6) = /(cos 6). 

[Hint: Simply change variables in the representation (3.52) by using the 
discrete orthonormal polynomials (3.61). The uniqueness of the discrete 
least square polynomials is used.] 

8. With the notation of equations (2), (3), (4), and (6), verify the discrete 
analogue of the Parseval equality , 


In 


n 


T l/(**)l 2 

- —n 


2n 


n 


2' |t/»(**)l 2 

— — n 



k,l 2 . 


[Hint: Use equation (5).] 

9. Prove the existence of a best trigonometric approximation in the form 
(1) for the given function /(x) on the interval [ — ?r, n]. 

[Hint: In Problem 2, if \f(x)\ < M and {x 4 } are distinct, then the trigono¬ 
metric sum S n (x ) has bounded coefficients. For fixed n y consider the non¬ 
empty set, C, of trigonometric sums {^(x)} such that 


II fix) - Sn(x) || 00 < y. 


Note that by the above remark, the coefficients of all of the sums in C are 
bounded. Let SV(x), v = 1, 2,...» be a minimizing sequence of trigonometric 
sums in C, that is, 

II/- 5b” || CO -*■ g.l.b. I/ - 5 b II CO. 

C 

Pick a subsequence of S n v such that their coefficients converge, i.e., 

G)c V} —* dk> V* 

The sum 

* a n A 

S n (x ) = -£ + 2 cos kx + b k sin kx 

k=l 


is a best approximation.] 




6 


Differences, Interpolation 
Polynomials, and 
Approximate Differentiation 


0. INTRODUCTION 

Interpolation polynomials are of particular importance in numerical 
analysis, and so we devote special attention to them in this chapter. We 
are led naturally to the study of differences; both divided differences for 
arbitrarily spaced points and ordinary differences for equally spaced 
points. Not only can differences be neatly arranged in tables, for convenient 
hand computation (presumably an affair of the past), but they permit one 
to easily estimate the error in the approximation. Hence, methods based 
on differences are useful in this age of digital computers as they suggest 
very efficient computing techniques and can be used for checking the 
accuracy of a calculation. 

We examine the error in interpolation, when the polynomial passes 
through equally spaced points, in some detail. This error is generally 
much less near the center of the interval of interpolation points and grows 
rapidly outside this interval, i.e., for what is termed extrapolation. There¬ 
fore, we construct special forms of the interpolation formulae which are 
convenient for evaluation near the center of the interpolation interval. 

We use interpolation polynomials to determine formulae for the numeri¬ 
cal approximation of derivatives of the interpolated function. 


245 
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1. NEWTON’S INTERPOLATION POLYNOMIAL AND DIVIDED 
DIFFERENCES 


We have shown in Chapter 5, Section 2, that the interpolation poly¬ 
nomial exists and is unique. Furthermore, for a fixed set of interpolation 
points, it is easily constructed using the Lagrange interpolation coefficients. 
The Lagrange representation has the defect that if another point of inter¬ 
polation were added, then the new higher degree interpolation polynomial 
could not be obtained by easily modifying the previous one. (This is in 
contrast, say, to Taylor’s series expansion or to least squares expansion 
in orthogonal functions where the next order approximation is obtained by 
simply adjoining a term to the present approximation.) We seek then a 
representation of the interpolation polynomial which has the property 
that the next higher degree interpolation polynomial is found by simply 
adding a new term. 

Specifically, let Q k (x) be the interpolation polynomial for f(x ), of degree 
at most k y with respect to the k + 1 distinct points x 0 , x u ..x k . We seek 
the successive interpolation polynomials, {£?*(*)}, of degree at most k 
in the form Q 0 (x) = f(x 0 ) and 

(la) Q k (x) = Q k -i(x) + q k {x\ for k = 1, 2, .. n, 
where q k (x) has at most degree k. Since we require 

Qk(*j) = /(*,) = Qk~ i(*y), j = 0, 1 ,..k - 1 

it follows that q k (x } ) = 0 at these k points. Thus, we may write 

k - 1 

(lb) q k (x) = a k [ ] (x - x,), 

; = o 

which represents the most general polynomial of degree at most k that 
vanishes at the indicated k points. The constant a k remains to be deter¬ 
mined. But, in order that Q k (x k ) — f(x k ), it follows from (la and b) that 


(lc) 


f(x k ) - gfc-i(Xfc) 

k - 1 ’ 

n ( x « - x,) 

) = 0 


for k — 1,2 


The zero degree interpolation polynomial for the initial point x 0 is, 
trivially, Q 0 (x ) = f(x 0 ). Thus, with a 0 = f(x 0 ), we obtain by using (la 
and b) recursively, for k = 1, 2 


(2) Qn(x) = a 0 + (x - XoK + ••• + (x - x 0 )-■ (x - x n ^ l )a n . 
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The k\h coefficient is called the kth order divided difference and is usually 
expressed in the notation 


(3) 


<*o = fix oL 

a k = fix* • • •> x k ], k — 1, 2 ,.... 


The values of f(x) which enter into the determination of a k are those at 
the arguments of f[x 0 , x u .. x k ]. We now obtain a representation for 
these coefficients which is more explicit than the recursive form given in 
(1). Since Q n (x) in (2) is the unique interpolation polynomial of degree n , 
we may equate the leading coefficient, a n , in this form with that obtained 
by using the Lagrange form, see (2.5) in Chapter 5. That is, from 


QJX) = 2 f{x,) ]1 


1- 0 *- 0 x i - x * 

k*j 


the coefficient of x n is 


(4) 


@n f[Xo, X\, . . X n ] n 


fix,) 


=o n i x , - x k) 


fc = 0 
(fc */) 


This form could also be deduced directly from (1); see Problem L 

From the representation (4) it follows that the divided differences are 
symmetric functions of their arguments. That is, if we adopt the additional 
notation 

f./.k. .. =f[x U Xj, X k ,...] 


then this symmetry is expressed by 


(5) 


fo. 1.. .« — fi 


>o >h .* 


• Jn 


where (j 0 ,j ly .. .J n ) is any permutation of the integers (0, 1 ,. . ., n). 

We may derive a more convenient form than (4) for computing the 
divided differences by again making use of the uniqueness of the interpola¬ 
tion polynomial. That is, we may construct the polynomial Q n ( x) by 
matching the values off(x y ) in the reverse order j — «, n — 1, . .., 1, 0. In 
this way we would obtain, say, 


( 2 ') 

where 


Q n (x) s bo + (x - x n )b, + ■■■ 

+ (X - *„)(.* - - • ■(* - *1 )b n 


b k = f[x n , x n _ u ..., Ar„_ k ] and b 0 = f[x n ] = f(x n ). 

But a n = b n since they are the coefficients of x n in the unique polynomial 
Qn(x) of (2) and (2'). * 
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Now if we subtract equation (2') from equation (2) but display only the 
terms which contribute to the coefficients of x n and x n " 1 we obtain, using 
Ctn = b n , 

0 = [O - x 0 ) - (x - x n )](x - *0- • (x - 

+ (a n .! - bn-Jx*- 1 + Pn- 2 (x) 

where p n - 2 (x) is a polynomial of degree at most n — 2 in x. Since this 
expression vanishes identically the coefficient of x n ~ x vanishes and hence 
0 — [x n — x 0 ]a n + — b n _ i). Now the symmetry of the divided 

differences, proven above, implies that 

bn - 1 =f[x n , X n _i, =f[x u X 2 , ...,X n ], 

whence from a n — (a n ^ 1 — b n -^j{x 0 — x n ), we have 


(6) f[x o, x 1? ..x n ] 


/[x 0 , x u . . Xn-t] - /[Xx, X 2 , . . ., X n ] 


X 0 - X n 


1 , 2 ,.... 


We leave it to the reader to verify (6) directly from (4) in Problem 9. 
This recursion formula justifies the use of the name divided difference. 
Of course, we then define for completeness 

/[*oJ = /(*£>)■ 

The interpolation polynomial (2) may now be written as 
(7) Q„(x ) = f[x 0 \ + (x - x 0 )f[x 0 , *!]+••• 

+ (x — *„)• • -O - *„-i)/[Xo, Xu ■ ■ .. -^n]- 

This form is known as Newton's divided difference interpolation formula. 
Note that to obtain the next higher degree such polynomial we need only 
add a term similar to the last term but involving a new divided difference 
of one higher order. 

In fact, let us set k = w + 1 in (lb and c) and (3) and then replace 
x n + i by x. We obtain 


( 8 ) 


fix) - Q n (x) 


[j>- 


Xj)\f[x o, Xi, ..., x„ x], 


which for x distinct from {jt y } defines the indicated divided difference. 
On the other hand, this identity gives another representation of the error 
in polynomial interpolation. 

By means of formula (6) applied to (x y , x y+1 ,..., x y+n ) we can construct 
a table of divided differences in a symmetric manner based on 


(6)/ 


f __/;• + !,/ + 2 i + n—fi.i + 1. 

Jj.j+1 . j + n - 


X/ + n - X y 
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Table 1 Divided Differences 


X 

fix) 

fix, x] 


fix , X, x] 

fix, X, X, x] 

Xo 

fo 

* ^ 

1 1 
* ^ 

III 




Xi 

A 

fo — fi _ r 
— J 12 

x 2 - x i 

f\ 2 

X 2 

1 1 

* Sr* 

III 

to 

/l23 — /oi 2 r 

— J 0123 

X 3 — x 0 

X 2 

fo 

fo — fo _ f 

— y 23 

*3 - X 2 

foz 

X 3 

— /12 _ r 

v ~ J 123 

— *1 

T234 — /l23 f 

r _ r “ /1234 

X 4 X 4 

X 3 

fo 

i* ^ 

1 1 
* ^ 

III 

fo\ 

x 4 

— fz3 _ r 

- J 234 

— x 2 


x 4 

h 






See Table 1. The divided difference required to determine g n + 1 (x) from 
0 n (x) is easily obtained from Table 1 by just completing another “diago¬ 
nal” line of differences. This simple property is not shared by the Lagrange 
form of the interpolation polynomial. 

Another representation of the divided differences, which is quite useful 
for estimating their magnitude as well as for many theoretical purposes, 
is contained in 

theorem 1. Let jc, Jt 0 , x u .. x k ^ x be k + 1 distinct points and let f{y ) 
have a continuous derivative of order k in the interval 

min O, * 0 ,.. x k . x ) < y < max (x, x 0 ,.. x fc _i). 

Then for some point £ — f(x) in this interval 

(9) /[* 0 , •••,**-1 , *] =-^yr- 

Proof\ From equation (8) with n replaced by k — 1 we write 
fix) - Qk-lix) = (x - x 0 )(x - *i) •• (x - X k . l )f[x 0 , x]. 

However, since Q k -i(x ) is an interpolation polyr : al which is equal to 
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f(x ) at the k points x 0 , x u .. x k - 1 it follows from Theorem 2.1 of 
Chapter 5 that 

f(x) - Qk-i(x) = R k -i(x) 

= (x - x 0 )(x - Xx)- • (x - x fc .!)-y^- 

But by the hypothesis (x — x 0 )(x — x^- ■ -(x — x te _i) ^ 0, and the 
theorem follows by equating the above right-hand sides. ■ 

A generalization, permitting coincident values, is established as Corol¬ 
lary 2 of Theorem 2 that follows. 

As an immediate consequence of Theorem 1, we can obtain some 
information on the divided differences of polynomials. These results may 
be stated as the 

COROLLARY. Let 

P n (x) — a 0 + otiX H-h a n X n , oc„ # 0, 

be any polynomial of degree n and let x 0 , x 1? ..., x k be any k 4- 1 distinct 
points . Then 

fa n if k = «, 

P n [x 0y x u ..., x k ] = < 

[0 if k > n. 

Proof. The corollary follows from Theorem 1 since d n P n (x)/dx n — 
n \ a n ; and higher derivatives vanish. ■ 

We shall require some continuity and differentiability properties of 
divided differences for our later discussion of the error in numerical 
differentiation and integration. Most of these results can be derived from 
still another representation of the divided differences which we state as 

theorem 2. Let /(x) have a continuous nth derivative in the interval 
min (x 0 , x l9 ..x n ) < x < max (x 0 , x ls ..., x n ). Then if the points x 0 , 
x u ..., x n are distinct , 

(10) n /[x 0 , x ls ..x B ] = f dt\ f 1 dt 2 - • • f " 1 dt n 

Jo Jo Jo 

X / (n) (tn[*n - ^n-l] + ’ * ■ + fi[*i - x 0 ] + X 0 ), 

where n > 1, t 0 = 1. 

Proof For an inductive proof, we first show that 

f[xo, Xi] = dt 1 f\t i [x x - x 0 ] + x 0 ). 
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Let a new variable of integration, be introduced by (since x ± ^ x 0 ): 


€ = h[xi - x 0 ] 4- x 0 , dh = difix! - x 0 ]. 
The integration limits become 

(h = 0) -> (f = x 0 ); 0i = 1) -► (f = * 1 ). 
Therefore, we have 

f <*i/Vi[*i - *o] + *o) = 1 ■ f 1 dgf'ig) 

J0 X 1 <*0 Jx 0 

■*1 -*0 

Now we make the inductive hypothesis that 

-^n — i] [* dt 1 f dt 2 % * * f dt n —i 

Jo Jo Jo 


X - X n - 2 ] + • • • 4- fit*! - X 0 ) + X 0 ). 

In the integral in (10) n we replace the integration variable t n by 
£ = t n [x n - Xn-i] 4- * * ■ 4- h[x! - jc 0 ] 4 x 0 , 


dtn = di/[x n - X n . x l 

The corresponding limits become 

On = 0) -V (f = fo = *n-l[*»-l - *n- 2 ] + * ’ * + “ *o] + *o), 

On ~ ?n- l) ^ A'n_ 2 ] 

4 /n-2^-2 *“ *n- 3 ] + ‘ * * + M*1 “ *o] + *(>)• 
Now the innermost integral in (10) ft is, since jc„ ^ x n . ly 

r^n - 1 

dt n f in \t n [x n — Jc n _j] H-h fjxj - x 0 ] + x 0 ) 

Jo 



X n -^n-l 

However, by applying the inductive hypothesis we have 

r* \ h dt 2 ... 

Jo Jo Jo ” -Yji -1 

_ f[x 0 , x n ] -/[x 0 , ...,X n _ 2 , X n _ i] 

** - *«-i 


Ax* ♦ * • > ^n]« 
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Notice that the integrand on the right-hand side of (10) is a continuous 
function of the n + 1 variables jc 0 , x l9 .. x ny and hence the right-hand 
side is a continuous function of these variables. Thus (10) defines uniquely 
the continuous extension of f[x 0 , x ly .. x n ] when the arguments lie in 
any interval of continuity of the nth derivative of f(x). That is, since the 
divided differences have only been defined for distinct sets of arguments 
[see (1) and the discussion leading to it], we are at liberty to define them 
when some of the arguments are not distinct. Naturally, we do this in a 
manner which maintains, if possible, the above observed continuity. If 
f (n) (x) is continuous, then Theorem 2 shows how this can be done for all 
differences of f(x) of orders 0, 1,.. n. These remarks can be summarized 
as 


corollary 1 . Let f {n) (x) be continuous in [a, b ]. For any set of points 
jc 0 , x l9 ..., x k in [a , b] with k < n let f[x 0y x u . .., be given by (10)*. 
The divided difference thus defined is a continuous function of its k + 1 
arguments in [a, b] and coincides with that defined by (4), or (6), when the 
arguments are distinct . ■ 


In fact, as in the proof of the First Mean Value Theorem for integrals, 
(I0) n yields 


m[ dt r f * dt 2 ' • • f * 1 dt n <f[x 0 ,---,x n ] 

Jo Jo Jo 

< m[ dt J f 1 dt 2 ■ • • f*“ dt n 

Jo Jo Jo 

where m = min / (n) (*) and M = max f in \x) for x in 

min (x 0 ,.. x n ) < x < max (jc 0 , ...» * n )- 
Then by the continuity of/ (n) there is a point p n in this interval such that 

/ (n) (Mn) 


f[x 0 , .. ., X n ] = ■ 


n\ 


Hence, we have established a generalization of Theorem 1, since the points 
Xi need not be distinct, in 


corollary 2. If f {n \x) is continuous in [a, b] and x 0 , x l7 ..., x n are in 
[a, b ], then 


f W r y 1 — 

J L**0» Xi, • ■ • > -X n J . y 

n ! 


( 11 ) 

where 

min (jc 0 , x n ) < f < max (a: 0 , .-^n)- 

A particular case is 
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COROLLARY 3. 

( 12 ) 


If /<">(*) is continuous in a neighborhood of x, then 


f[x, X, ..., x] 

V v 

n + 1 terms 


f in Kx) 

n\ 


Now, we can deduce yet another representation of the divided difference, 
when several multiplicities! occur. 

corollary 4. Iff (n) (x) is continuous in [a, b], y 0 , y x , .. y n are in [a, b] t 
and x is distinct from any y i9 then 


(13) 


f[x,y 0 ,yi, •••,>'»] 


fix, n, • • •. y«\ - fly* n, • • •, y*\ 
x - y 0 


gives the unique continuous extension of the definition of divided difference. 

Proof By Theorem 2, the right-hand side of (13) is uniquely defined 
and continuous in (y Q , y l9 .. _y n ) for y 0 # a:. Hence, the left-hand side 
is the unique continuous extension of the definition of divided difference 
no matter what multiplicities occur in ( y 0 , y u ..., y n ). Observe that only 
the continuity of the wth derivative is used. ■ 


corollary 5. If x t ^ y h for 0 < i < p, 0 < j < q; ff m \x) continuous 
in [a, b ]; {x t } 9 { y ,} in [a,b];0<p 9 q< m, then 


(14) 

where 


fix o> ... 9 x P9 y 09 .. . 9 y q ] = g[x 0y .. x p ] 

= h[y 09 ... 9 y q ] 

g(x) = f[x, y 09 ... 9 y q ] 9 h(y) = f[x 0 ,..., x py y]. 


provides the unique continuous extension of definition of divided difference. 

Proof By (13), g(x) has m continuous derivatives for x ^ y {9 0 < i < q. 
Therefore, by the theorem, g[x 0y ... y x p ] is defined and continuous in 
x 0y ..., x p if x t ^ y j9 as postulated. Furthermore, g(;t) is continuous as a 
function of the parameters (j 0 ,..., y q ) if x # y h by Corollary 4. Hence, 
the representation (10)j, of g[x 0 ,.. x p ] yields the continuity of g[x 0 , 
x p ] with respect to all variables (x 0 ,..., x py y 0y .. ., y q ) provided merely 
that x t # y jm 

Now the function f[x 0y ..., x py y 0y ..., y q ] as defined in (14) is con¬ 
tinuous (if x t ^ y < y ) in its variables; hence we conclude that (14) is the 
unique continuous extension, since (14) is valid when the arguments are 
all distinct. ■ 


f The conclusions of Corollaries 4-7 that follow, concern continuity properties of and 
representations for divided differences that are easily established when no multipli¬ 
cities occur among the arguments. When multiplicities do occur, the corollaries estab¬ 
lish the same representations under the hypothesis of minimal differentiability of 
Ax). 
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COROLLARY 6. If /( x) has a continuous derivative of order m in [a, b]\ 
x 0 ,..x p , V 0 ,. . y q , z 0 ,Z r are in [a, b]\ x, / y„ x, ± z k , y f # z k 
for all i,j,k;0 < p,q,r < m; then 

(15) f\x o,..Xp, y 0 ,..y q> Zq ,..z r ] 

1 d> &> d T 

“ p'.q'.r'.Sx* 8y« 8z' J[X ’ y ' 2J 

where 

min (x 0 ,..., Xp) < $ < max (x 0 ,..x p ), 
min (y 0 , ...,y q )<rj< max (y 0 ,..., y„), 
min (z„,..., z r ) < { < max (z 0 ,.. z T ). 

Proof Let 

g{x) = f[x, y 0 , ■ ■ •, y„ z 0 , • • •, z r ] 

(16) h{y)= f[x, y,z 0 ,..., z r ] 
k(z) = f[x, y, z\ 

By Corollaries 2 and 5 [appropriately generalized for sets of variables 
({*<}, {yf {z fc })], we have 

(17) /[x 0 ,..., Xp, y 0 , ...,y q ,z 0 ,..z r ] = g[x 0 ,..., x p ] 

(18) g[x 0 ,..., x p ] = g(x) 

(19) g(x) = h[y 0 , ...,y„ h(y) 

H • °y y = v 

(20) h{y) = k[z 0 ,..., z r ] = i |- r *(z)| • 

The conclusion (15) follows from (17), (18), (19), and (20). ■ 

A special case is contained in 


corollary 7. If f {m \x) is continuous in [a, b]; x , y, z are distinct points 
in [a, b]; 0 < p,q, r < m; then 


( 21 ) 



1 d p d q d r 
p\q\r\ 3x p dy Q dz r 


f[x, y, z]. ■ 


We leave to Problems 3, 4, 5, 6, and 7, the independent proof of some 
simple differentiability properties of divided differences, which are needed 
later. 
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1. Deduce the representation (4) for divided differences directly from the 
expression (1). 

[Hint: Use induction and the form (2) for Q k + i(x k ).] 

2. If P n W is a polynomial of degree n , show that /^[xo, ■*] for x ^ x 0 is a 
polynomial of degree at most n — 1 in at. 

3. Prove the following 

lemma 1. If x, x 0t . .., x n are n + 2 distinct points then 

nx o,...,*.,*] = i « flx ’ Xi] — 

10 n (*> - x k) 

k-0 

[Hint: Use equation (8) and the Lagrange form of the interpolation poly¬ 
nomial.] This is another representation of the divided differences which is 
very useful in deriving their continuity and differentiability properties. 

4. Prove directly the following: 

theorem 3. If /( x) e C[a, b ] and f'{Xj) exists for some fixed x f e [a, b]> 
then f[x, Xj] is a continuous function of x in [a, b] if we assign to it the value at 
X = x s \f[x u Xf] = f'(Xj). 

5. Prove the following: 

theorem 4. Let f\x) e C[a, 6] and f"{x) be continuous in an (arbitrarily 
small) interval about some fixed xj e [a, b]. Then df[x> x^/dx is a continuous 
function of x in [a, b]. 

[Hint: Form df[x , x^jdx for x ^ x jf use the Taylor expansion about x f 
and take limits as x = Xj ± h —> x ,.] 

6. Use the results of Problems 3, 4, and 5 to state and prove, if (x 0 , jti,.. 
x n ) are distinct and in [a, b], 

(i) a theorem on the continuity of f[x 0 ,.. jc„, x] for x e [a, b\; 

(ii) a theorem on the continuity of (d/dx)f[x 0 , .. x n , x] for a e [ a , b], 

7. By using the theorem under (i) of Problem 6, note that f[x Qy ..., * n , x n ] = 
lim f[x o,. . .,x n , x n + h\. Therefore, show that 

h- 0 

f\x o, . • X n , X n ~\ — ^ f[x 09 • • -^n]- 

Prove that this representation is valid under the conditions: f'(x n ) is defined 
and Jt 0 > x u ■ * •, x n are distinct. 

[Hint: use the lemma of Problem 3 and the formula (6).] 

8. Prove the symmetry of the divided difference by constructing the Q n (x) 
in (2) using the points (^ 0 , x u .. x n ) in an arbitrary permuted order 
( x jQy x ilt .. x Jn ). [This is a generalization of what was done in deriving (2') 
and proving a n = b n .] 

9. Verify equation (6) (the divided difference property) directly from equation 
(4). 

10. * (Osculatory interpolation ). If f(x) and its derivatives of order r 0 — 1, 

r x - 1 1 are defined respectively at the distinct points (* 0 , x u . .., * n ) 
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in [a y b ], then there exists a unique polynomial Q N (x) of degree at most N, 
where N = 2 r * ~~ 1, such that QfftXxj) = / </c) (j t>) for k = 0, 1,.. r f - 1, 

fc = o 

y = 0, 1,..., n. The special case r 0 = r 1 = • • - = r n = 2 has been studied 
in Chapter 5, Subsection 2.2. 

[Hint: Show that the Newton form of the polynomial may be derived by 
satisfying successively all of the r 0 conditions at x 0 before proceeding to satisfy 
successively the r x conditions at x ly etc. 

Arrive at the scheme 

Qo(x ) = f{x o) 

Qs(x) = Qs-i(x) + b 3 (x - x 0 y , for 1 < s < s 0 where s 0 = r Q 

QM = Qs-ito + 6^11 (x - *k) r *j(* - XfY-'i- 1, 

i 

for < s < s fi with j = 1, 2,..where Sy = 2 

k = 0 

Show that the b s may be recursively defined. The proof of uniqueness may 
be based on the fact that if a polynomial has degree at most N and at least 
N + 1 zeroes (counting multiplicities), then the polynomial is identically 
zero.] 

11. * If / <p) (x) is continuous in [a y b] y 0 <, r, — 1 < p 9 {**} in [ a , b] y then 
show that the coefficients b s of Problem 10 are divided differences of f{x) of 
order s , based on the first s 4- 1 arguments in the sequence x Qy x 0> ...»* 0 ; 
Xu x ly .. .* where each appears r { times (the divided differences have been 
defined in Theorem 2, Corollary 5). 

12. * Verify that the error in osculatory interpolation (see Problem 10) is 

FI (* - *t) r ‘ 

r n (x) = nx) - Q„ix) = <= ( ° Ar+1)! /<* +i) (a 

if/has N + 1 continuous derivatives in [a y b] y where f is in [a y b]. 

13. Given the values 

sin (1.6) = .9995736030 cos (1.6) = - .0291995223 

sin (1.7) = .9916648105 cos (1.7) = - .1288444943 

approximate sin (1.65) to seven decimal places by evaluating Taylor’s series 
(about x = 1.6) including the third derivative term. Estimate the error by 
examining the remainder term in the formula. Calculate sin (1.65) correct to 
9 decimal places and verify that the above estimate of error is correct. 

14. Use the table in Problem 13 and calculate sin (1.65) by linear interpola¬ 
tion. Verify that the magnitude of the error is consistent with the remainder 
term as given by (8) and (9), or equivalently by equation (2.9) of Chapter 5. 

15. Construct a table of divided differences from the values given in Problem 
13 for the function sin x with the repeated arguments (1.6) (1.6) (1.7) (1.7); 
and find the Newton form of the osculating polynomial of degree 3. Calculate 
sin (1.65) by evaluating the osculating polynomial, and verify that the magni¬ 
tude of the error is explained by the formula in Problem 12. 
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Table for Problem 15 



Ax) 

fix, x] f[x, X, x] f[x, X, X, at] 

(1.6) 

sin (1.6) 

cos (1.6) 

(1.6) 

sin (1.6) 

10 (sin 1.7 - sin 1.6) 

(1.7) 

sin (1.7) 

cos (1.7) 

(1.7) 

sin (1.7) 


16. Given the repeated arguments x 0 , x 0 , x 0 , Xi, Xi and the values / (p) (xj) 
in the accompanying divided difference table. Complete the table and verify 
that the fourth order difference has the value given by (21) if xi = x 0 + h. 



Table for Problem 16 


fix] Ax, X ] 

/[x, X, x] /[x, X, X, x] /[x, X, X, X, x] 

Xo 

fo 



/o' 


Xo 

fo 

fo* 

2 


fo 



fo 

foi — fo' 

*0 

h 


fox 


Xi 

fx 

f\ — fox 

h 


fx' 


Xi 

fx 



17. Given the m + n + p + 3 points (x 0 , x u * • yo> y i,. .p n , z 0 * 
Zi, ...,z p ) and /( jc ) which has derivatives of order (w, /i, p) respectively, 
in a neighborhood of each of the distinct points (x, y, z). Show that /[x 0 ,. .., z p ] 
_ ^<w)(x) + h in) (y) + k iv) (z) if x f = x, y^ — y, z r = z for all /, /, r, where 


*(*) = — 


Ax) 


n (* - y>) U(X- Zr) 

1 = 0 r = 0 


AO) - — 


Ay) 


k(z) = 


no- *>) no- 

f = 0 

/(f) _ 

n ( z - •*<) n ( z - 

<=o y=o 
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[Hint: f[x 0y x u .. z p - u z P ] = g[x o, * 1 ,. . x m ] + h[y Q , y u ..J>n] + 

A:[z 0 , zi # . .., z p ] for distinct points (* 0 ,. .., z p ). Therefore, 

/[*o, • •*m, j> 0# • ■ y n > *o...z P ] = g (m K£) + A (n) C»?) + * (p) (f)- 
Now let x t —► x y y f -> >>, z r -> z, all /, y, r.] 


2. ITERATIVE LINEAR INTERPOLATION 

The Newton form of the interpolation polynomial permits one to in¬ 
crease easily the accuracy (actually the degree) of the approximating 
polynomial. It has many important applications and is indeed well suited 
for computations when the data are available in the appropriate tabular 
form. However, it can be viewed as one of a class of methods for generating 
successively higher order interpolation polynomials which we consider 
briefly. These procedures are iterative and can be very effectively employed 
on modern digital computers since they are based on successive linear 
interpolations. 

The lemma underlying the iterative linear interpolation schemes can be 
stated as 

lemma 1. Let x il9 x t2 , . . x in be n distinct points and denote by 
Pi x ,i 2 .i n (x) the interpolation polynomial of degree n — 1 such that 

Pit . i s .<«C*t v ) = /(*J, v = 1,2 

Then if x j9 x k9 and x iv , v = 1, 2 ,..m are any m 4* 2 distinct points 

(1) Pii.h .im./.feC*) 

= (X - X k )P hti2 . imt yCx) - (x - . 

Xj - x k 

for m = 0, 1, 2,.... 

Proof We establish (1) by observing that the right-hand side defines 
a polynomial of degree at most m + 1 which takes on the values f(x iv ) 
at x lv for v = 1,..., m 9 f{Xj) at x j and f(x k ) at x k . Hence, the polynomial 
on the right-hand side of (1) is the unique interpolation polynomial which 
appears on the left-hand side of (l). ■ 

The variety of schemes which employ Lemma 1 to determine successively 
higher order interpolation polynomials differ in the order in which the 
pairs of values (, x h f(xf) are used. For many applications, particularly 
on digital computers, the function values are generated sequentially and 
it may not be known in advance how many values are to be generated. 
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Table 1 


X 0 

fix o) 




Xi 

fix x) 

Po. iM 



*2 

fix 2) 

p 1 . 2M 

Po.1. 2(^0 


X k 

f(x k ) 

Pk~ 1. k(x) 

Pk~2. fc-l. k(x) 

Po. 1 . fcM 


For such cases we may employ always the latest pair of values and compute 
each row sequentially to find the array in Table 1. AnyP .(■*) in this scheme 
is obtained by using the two quantities to its left and diagonally above. 
Thus, to determine, say, the (k + l)st row, only the /cth row need be 
retained. Of course, as more points are generated, rows of greater length 
must be saved. If it is known in advance that a fixed number, say k + 1, 
of function values is to be generated, then a different order of computing 
is appropriate. That is, we may compute by columns and when a particular 
column has been evaluated the preceding column may be discarded. 
The schemes based on Table 1 are known as Neville's iterated interpolation. 

Another sequence of interpolants are used in Aitken's iterative inter¬ 
polation as is indicated in Table 2. Again, computation by columns is 


Table 2 


*0 

fix 0 ) 




Xi 

fix,) 

Po. lix) 



*2 

f(x 2 ) 

P 0 . 2 M 

Po.l. 2(x) 


X k 

fix k ) 

Po. kix) 

P0.1. fcW 

Po. 1 . 


appropriate for a known fixed value of k. Note that the (k + l)$t row 
can be computed if we save only the “diagonal” elements f(x 0 ), P 0 i(x ) y ..., 

P 0 .i.*(*)♦ brief, the basic difference between these two procedures 

is that in Aitken’s, the interpolants on the row with x k use points with 
subscripts nearest 0, while in Neville’s they use points with subscripts 
nearest k y as we read the entries from left to right. 

A particularly important application of Neville’s method is to what is 
called iterative inverse iterpolation . Given y = f(x) we define the inverse 
function, x = g(y), such that y — f(g(y)) and x — g(f(x)). Then it is 
desired to find a particular value of x, say x = x y such that f(x) = y. 
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Thus, we need only find the value of g(y). Given pairs of values ( x u f(x { )) = 
(g( yi), yi) 9 interpolation may be used to approximate g(y). As additional 
pairs ( x j9 f(x j )) become available, better or rather higher order inter¬ 
polation can be obtained by using Neville’s scheme. Now, however, in 
Table 1, the first two columns are interchanged and the argument of the 
polynomials is / (or y). The computation should be done by rows. After 
several steps it is advisable to use only a few of the row elements and to 
discard those rows involving values x t which were “early” iterates and 
thus presumably not sufficiently close to the desired value, x. It should 
be noted that this procedure is essentially one of the generalizations of the 
method of false position or of Aitken’s S 2 -method (see Chapter 3, Sub¬ 
sections 2.3 and 2.4). When the evaluation of the function f(x) is not a 
difficult task, most workers prefer to use only a single linear interpolation 
at each stage, where the values [;c fc , f(x k )] and [;t te+1 , /(x fc + 1 )] are the most 
recent, i.e., regula falsi. Of course, inverse interpolation, in general, is 
meaningful only if / -, (jt) is defined as a single-valued function over the 
interval in question. 


PROBLEMS, SECTION 2 

1. Given the accompanying table for sin (*), interpolate for sin (1.65) by 
determining the value P 3 (1.65) of the Lagrange interpolation polynomial 
with the use of both the Neville and the Aitken schemes of successive linear 
interpolations. (Make up Tables corresponding to Tables 1 and 2.) 

Table for Problem 1 


sin (1.5) = .9974949866 
sin (1.6) = .9995736030 
sin (1.7) = .9916648105 
sin (1.8) = .9738476309 


2. Evaluate sin (1.65) by finding the Newton form of P 3 (1.65). Compare 
the amount of work in Problems 1 and 2, when done by hand. 


3. FORWARD DIFFERENCES AND EQUALLY SPACED INTER¬ 
POLATION POINTS 

Many of the results in Section 1 are simplified and additional important 
consequences are obtained if the points of interpolation are equally 
spaced. Very many, if not most, of the practical applications are with 







A relation between divided differences with equally spaced arguments 
and forward differences is easily obtained from the representation (1.6) 
and the definition (3). Thus by taking x to be any one of the points x j 
of (1), say x 0 , we have 

A/(*o) =/(*!> -Ax o) = ( Xl - Xo) /( - V ; > ~{ (Xo) = hf[x 0 , Xl ). 

X 1 “ -*0 

Now to proceed by induction on n , the order of the difference, we assume 
that 


( 4 ) 


A n /(x 0 ) = n\ h n f[x 0 , Xl ,..., x n ]. 
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A n+ 1 /(*o) = A"/(*!) - A»/(* 0 ) 

= n\h n f[x u x 2 ,. .x n + 1 ] - n\ h n f[x 0 , x u ..x n ] 

_ .. . Ln/„ .. x 2 ,...,x n+1 ]~ f[x 0 , x u ..., x n ] 

— n. n {X n + i ~ X 0 ) 7“ “T 

t^n + 1 ~ x 0j 

= (« + 1)! h n + 1 f[x 0 , x u .. x n + I ]. 

Thus (4) is established for all n > 1. 

Another representation of the forward differences can now be obtained 
by specializing the representation (1.4) to equally spaced points. We note 
that, by (1) 

fl (*I - x,) = n (/ - j)h 


i = o 
V*i) 


i = 0 
U*i) 


= ^no-y) n («■-/> 

j = 0 I = i + 1 

= (~l) n “ 4 A tt (i)!(7i - /)! 

By using this result in (1.4) we obtain 


( 5 ) 

where 


fix 0, Xu .. .,x n ] = ^ 2 (-l) n -'(")/W. 




/! (« - /)! 

are the usual binomial coefficients and 0! = 1. From (5) in (4) there results 

(6) Ay(* 0 ) = 2 

A final expression for the forward differences is obtained by using (4) 
in (1.9) with x = x k to get 

( 7 ) A”/(* 0 ) = hT n \ 0 ; Xo < £ < X n . 


Of course, (7) is valid assuming that f(x) has an nth derivative in the 
indicated interval. 

It is clear, from (7), that the nth forward difference of a polynomial of 
degree n is a constant and that higher order differences vanish. More 
generally, if f(x) has all derivatives bounded, say |/ (n) (x)| < M for all n , 
then (7) implies that 

|A n /| < Mh\ 
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Thus if h < 1 the magnitude of nth order differences of f(x) will in general 
decrease as n increases. On the other hand, if the rcth derivative of f(x ) 
grows with n , the «th difference will decrease only if h is “sufficiently 
small.” This may be illustrated by the function f(x) = e ax where a > 0. 
Clearly, f in) (x) = a n e ax and using (7) A n /(jt 0 ) = h n a n e ai . If the inter¬ 
polation points are to be confined to the interval x 0 < x < x 0 + L, then 
0 < nh < L and the differences will generally decrease only if h < 1 /a 
(we here neglect the variation in e ai which may occur). Finally, if f (n) (x) 
is not bounded for all n we can expect |A n /| to decrease, for sufficiently 
small h , only for the first several values of n. This heuristic observation is 
the basis of a method for detecting isolated errors in forward difference 
tables of supposedly smooth functions. 

To describe this method we first observe that 


A n [/(x) + £0)] = A "/0) + A n g(x). 


Now suppose that f(x) is a smooth function, say for simplicity with all 
derivatives bounded, and that in tabulating this function an error of 
amount 8 is made in the single entry f(x,). Thus the function actually 
tabulated can be written as f(x) + g(.x) where 


g(*i) = 



* * j\ 
i = j- 


Applying the representation (6) we see that the column of nth differences 
of g will contain quantities of the form 


Thus in examining the higher differences of the tabulated data f(x) + #(•*), 
since A n f decreases with n, an error will be apparent if the entries, from 
some column on, alternate in sign and the magnitudes of these varying 
entries are proportional to the appropriate binomial coefficients. The 
power of this method is illustrated in Problem 2. 

This procedure will not be practical if the isolated error 8 is of the same 
order of magnitude as the roundoff errors generally present in all of the 
tabular data. In fact, we shall see that if roundoff errors are present, 
differences of a sufficiently high order may have no significance. Let the 
tabular entries be, f(x) + p(;c), where p(x) is the rounding error. 

Let e be a bound on the rounding error, i.e., \p(x)\ < e. The worst 
possibility for the distribution of these errors is 

(8) p(x t ) = (-l)’e. 
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Table 2 


p(x) 

A 

A 2 

A 3 

A 4 

6 

— 2c 




— e 

2e 

4c 

-86 


€ 

— 2c 

-4c 

86 

16 6 

— € 

2c 

4c 




This is made clear by the table of differences for such a distribution 
(Table 2). Any other distribution leads to some entries which would be 
less in absolute value than the corresponding entries above. From Table 2 
we see that 

|A>(x ; -)| - 2V 

[This result may be easily proved for y — 0 by using (8) in (6).] Thus the 
roundoff error present in the nth difference is at most 2 n c, If there are s 
decimals retained with a roundoff error of at most one-half unit in the 
last place then € = \ 10 ~ s , and the bound on the roundoff error in the 
nth difference becomes 2 n_1 10" s . 

3.1. Interpolation Polynomials and Remainder Terms for Equidistant 
Points 

The Lagrange and Newton forms of the interpolation polynomials 
become simplified when the interpolation points are equally spaced. To 
be consistent with the notation (1) we introduce a new independent 
variable, /, by setting 

(9) Jt = x Q + ih. 

Thus, t measures jc — x 0 in units of h and is an integer only at the points 
x j of (1). 

Now the Lagrange interpolation coefficients, (2.6) of Chapter 5, can 
be written as 

4>n.j{x) = <t>n.i(x 0 + th) = fl * ** 

k = 0 
Uc*j) 
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It is convenient to introduce 


( 10 ) 


7T 0 (t) = t , 

”n(t) = t(t - i)- • •( t - n), « = 1 , 2 ,.... 


The function v n (t) is a polynomial in t of degree n + 1 and is frequently 
called the (n + 1 )st factorial polynomial. In terms of this polynomial, 
the Lagrange interpolation coefficients become 


4>n.j Oo + th) 


ff n(0 (n\ (-l) n 

nl \jj t-j ’ 


Using this formula in (2.5) of Chapter 5, the Lagrange form of the inter¬ 
polation polynomial simplifies to 


(i i) p n ( Xo + th) = ^p- J (- ir/H 

Newton’s form of the interpolation polynomial is simplified by using 
(4), (9), and (10) in (1.7) to get 

(12) Q n (x 0 + th) = f{x 0 ) + A/(x 0 ) + A 2 /(x 0 ) + • ■ • 

+ ^i(£) An/(Xo) 


The error in these interpolation polynomials is, by (1.8), (9), (10), and 
Theorem 1.1, 


(13) 


R n (x) = R n (x 0 + th) = n n (t)h n + 1 f[x o, • • •, x n , x] 


= rr n (t)h 


n + 1 


/ (n+1) (Q 

(« + 1)!‘ 


R n (x) may also be called the remainder for the interpolation formula. Of 
course, the final form is valid only if f{x) has n + 1 derivatives in the 
interval containing x, x 0 and x 0 + nh . To obtain a clear idea of the be¬ 
havior of this error, we shall study some properties of the factorial 
polynomials 77 n (r). 

From the definition (10) it is clear that t r n (0 has n + 1 real roots and 
they are at t = 0, 1,..., n. These polynomials have certain symmetries 
about the point t = n/2 which is the midpoint of the zeros of rr n (t). 


lemma 1. For n odd , 


^6 - t ) = m + t )’ 
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*•(2 - T ) = + T ) 

[/.e., 7r n (?) w anti-symmetric about t = n/2]. 

Proof. Clearly, 7r n («/2 — r) and 7r n (nj2 + r) are polynomials of degree 
« 4- 1 in r and have the same n + 1 roots: 

n n t n 0 /1 

t " 2 > 2 “ ”2 

Thus, these polynomials differ by at most a constant factor. Comparing 
coefficients of the leading terms in each then yields 



A further result which contains a comparison of the magnitudes of 
7 T n (t) at various points is contained in 

lemma 2. (a) Let t + l be any non-integral point in 0 < t + 1 < njl . 

Then 

W(* + 01 < kn(0|- 

(b) Let t be any non-integral point in n/2 < t < n. Then 

WO I < W* + 01- 

Proof. Since, in part (a), t + 1 is non-integral, t < n is also non¬ 
integral and we may form 

W* + 0 = (f + l)(Q(t- !)•••(? -n+ 1 ) 

WO ( 0 (t - 1 ) -o - n + l)k - n) 

t_ + 1 _ t_ + 1 _ t_ + 1 

t — n n — t (h + 1 ) — (t + 1 ) 

<_ nil _=_1_< i 

- (n + 1 ) - (n/2) 1+2 /n 

Thus, part (a) is proved. Part (b) follows from part (a) by using the sym¬ 
metry properties of Lemma 1. ■ 

The properties of 7 r n (f) proven in Lemmas 1 and 2 are illustrated in 
Figure 1 where n n (t) is plotted for n = 5 and n — 6. It easily follows from 
Lemma 2 that the maximum of |^ n (f)| in [0, n ] occurs in the interval 
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(0, 1), or equivalently in (n — 1, n). A lower bound on this magnitude is 
furnished by 

max |77 n (/)| > |i7 n (i)| = f] \i — j\ 

0 < i < n j — 0 

(2 n)l 
2 2n + 1 nl 


Using this bound we may compare the quantity h n + 1 7T n (t) for equally 
spaced interpolation points with the corresponding error factor , 

n 

Q (x — Xj ), for the Chebyshev interpolation points (i.e., with that 

i = 0 

factor which deviates least from zero determined in Chapter 5, Subsection 
4.2). If the interpolation is to be employed over the interval [a, b ], then for 
the Chebyshev points we have by (4.17) of Chapter 5 

M Ch = max l[(x - Xj) = Tn —; 

a<x<b y — o Z \ Z / 

and this value is attained at least n + 2 times in the interval. For equal 
spacing in [a , b ], we have h = (6 — a)/n and so 

Thus, using Stirling’s formula we find that for n large 


M Ch < /£\ n _ 


V2 \4, 


: (0.6796.. .) n . 


So if interpolation is to be employed over the entire range [a, b ], we find 
that the ratio of the maximum error factors essentially decreases, at least 
exponentially, for large n. Thus, the Chebyshev fit is better in the above 
comparison. However, we may only want to employ the interpolation 
polynomial near the center of the interval [a, b]. Specifically for n odd, 
say that n = 2m 4- 1, and that we are interested in the error over 

\a 4- b h a + b h] 

[~2 2 ’ ~^ 2 ~ + 21 ’ 


i.e., an interval of length h centered in [a, b]. Now we find that 
M Eq * = max |A« + V.(0| = h" + 1 \n n (m + *)| 

m<t<,m+ 1 


lb - a\ n+1 ! n\ \ 2 
_ l n / \2 n w!/ ' 
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Figure la. Error factor for interpolation with 6 equidistant points. 


Hence it follows that for large n = 2m 4- 1, 

0 *!(£)"-*1.3591...)-. 

We thus find that for interpolation near the center of the interval of inter¬ 
polation points the error factor for equally spaced points is exponentially 
smaller than the maximum for the Chebyshev fit over the entire interval. 
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Figure 1 b. Error factor for interpolation with 7 equidistant points. 


It should also be observed (see Figure l) that the error factor |77 n (/)| 
grows rapidly for t < 0 and t > n. Thus if the “equally spaced” inter¬ 
polation polynomial is used to extrapolate a function (i.e., to estimate 
values outside the interval of the interpolation points) we may expect the 
error to be much larger, in general, than it is for interpolation. Of course, 
the same is also true of the extrapolation error using the Chebyshev 
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points. In fact, for any distribution of interpolation points, the growth of 
the magnitude of the error factor can be bounded, outside the interval of 
interpolation points, by the error factor with the Chebyshev points. 
More precisely, for any choice of interpolation points x 0 , jc 1( ..., x n in 

n 

[—1, 1] the error factor is, say, YJ C* — *,) = p„ + i(4 If the Chebyshev 

; - 0 

points are used, then 

n 

X i) = T n + l(x). 

3 = 0 

Now if M = max |/? n + 1 (x)|, it can be shown (see Problem 5) that for all 

- 1<X<1 

a: such that |jc| > 1 

bn+iWI ^ M2 n |r n+1 (x)|. 

The equality can only hold if p n + 1 (x) = T n + 1 (x). 

3.2. Centered Interpolation Formulae 

Consider the error (13) in the interpolation polynomial for equal spacing. 
This error may be estimated if/ (n+1) (£) does not vary “too much” in the 
interval min (x 0 , x) < £ < max (x n , x). [An idea of this variation may be 
obtained, as a result of (7), by examining the differences A n + 1 /] If the 
variation is not too large, then as an estimate of the error we may use 

R n (x) = R n (x o + th) £ ^2-A n+1 /(* 0 ). 

An approximate bound on the error is obtained if ir n (t) is replaced by its 
maximum absolute value in the interval in question. 

Since, in general, we do not know/ (n+1} (£), the best that can be done in 
order to obtain the smallest possible error is to use the interpolation poly¬ 
nomials only for that range of t where ir n (t) has its least absolute value, 
i.e., by Theorem 2, for t near /i/2 or equivalently for x near the midpoint 
of [x 0 , x n ]. So if there are tabular points equally distributed about the 
interval of interpolation, then the interpolation polynomial to be employed 
should use tabular points centered as nearly as possible about the interval 
of interpolation. It is clear (see Figure 1) that when x is outside the interval 
[x 0 , x B ], or near the endpoints, |77- n (0| ma Y be relatively large and, if so, 
may cause the extrapolation or interpolation error to be relatively large. 

It is rather inconvenient, in general, to locate in the Difference Table 
1 those differences which must be employed in (12) when t is at n\2. 
For this purpose we derive special formulae which simplify the task. 
Let us assume that the interval in which the interpolation is to be done is 
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x 0 < x < x u and that we have arbitrarily many tabular points x }i j = 0, 
±1, ±2,..., about this interval. Then the ordinary Newton interpolation 
polynomial, using successively the points x 0 , x u jc_ x , x 29 x_ 2 ,... , will have 
the desired features with regard to the interval [x 0 , x x ]. This polynomial 
is of the form 

(14) Qn(x) =fo + (x- X 0 )f 0%1 + (x - x 0 )(x - *!)/(>.,. -! 

+ (x - x a )(x - x Y )(x - X-i)/o.i,-i.2 + • • •, 

the form of the final term depending upon the oddness or evenness of n. 

However, since the divided differences are symmetric functions of their 
arguments we may write 

fo. 1 , - 1 = f - 1 . 0 . 1 ? 

/o.l. -1.2 — f - 1.0.1.21 


fo ,1,-1,. , m, - m f-m, .,—1,0,1, , m* 

Then using (4) with the appropriate shifts in the subscripts, 
/- i.o.i — 

f- l.o. 1.2 — TjTT3 ^ 3 f- 1 j 


./—m, -1,0,1.m 


(2m)! /z 2) 


i A 2m /_ „ 


l 


f _ ___ A 2m + If 

J -m.. ., - 1,0,1,. ..m,m + 1 (2m + l)!/l 2m+1 J-m* 

The interpolation polynomial (14) can now be written as, for even n — 2m: 
(15a) Q n (x 0 + th) =f 0 + tAf 0 + t{t ~ 0 A*/_ x 


2 ! 

t{t - 1)0 + l) 
3! 


A 3 /-x + 


rp- l)«+ l)---(r-m) A2m , . 

~ t " (2m)! / “ m ’ 

and for odd n = 2m + 1: 

(15b) Q n (x 0 + th) = / 0 + fA/ 0 + ■ ■ • 

, ?(? - 1)(? + 1)• • •(/ - m)(? + m) 


__ A2m + If 

' (2m + 1)! 

This is the Gaussian {forward) form for the interpolation polynomials. 
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The differences used in forming these polynomials are on the line containing 
x 0 and the line between jc 0 and x 1 (see Table 3). 


Table 3 Differences 


X 

fix) 

A 

A 2 

A 2,n A 2m +1 

X- m 


A/- m 



X-i 

/-1 

A/-! 



Xo 

/o 


A 2 /-! •• 

• A 2m /-m 



A/i 


A2m + l/_ m 

*1 

/i 

A/m-l 



Xm 

/m 





A more symmetric form of the interpolation polynomial can be obtained 
when n — 2m + l (i.e., of odd degree). For this purpose we must introduce 
the centered difference notation, for even order differences, 


(16a) 


S 2 /, = A f T - A/ r _ t = A*/ r _ i; 

3 2k fr = 8 2 (S 2 ‘*-i>/)r = A 2fc / r _ fc , A: = 2, 3. 


The point x r is always the midpoint about which these differences are 
centered. With this notation, all odd order ordinary differences, higher 
than the first, can be written as the difference of two centered differences: 


(16b) A 2m + 1 /_ m = - S 2m / 0 ; m = 1, 2,.... 

Using (16) in (15b) yields: 

Qn(xo + th) =/o + tAf 0 + A 2 /_ 1 + + 1} A 3 /.! + . 


3! 


I tJl - jX£ + !)• • <t - m) 

(2m)! 

r(t — 1)(? + !)■ • (r - w)(r + m) 


+ 


(2m + 1 )! 


A 2m+1 f- m 
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= /o + ^(/i-/o) + ^2 r ^8 2 /o 

+ - yp + 0 (& 2 A - 8%) + ■ • • 

, t(t - 1)( t + l) -(/ - m) Mm/ . 

+ (2mjl 6 70 

, <0 - 1)(' + !)•••(*- "»)(r + m) , Mm/ . swx 
+ Om 4 - nt V h - 6 7o) 


— ?/i + 


(2m + 1)! 

?(/ - l)(f + 1) M<P , 
T1 °J 1 + ' 


, *(' - 1 )(* + I) • (t ~ m)(t + m) s2m/ . 

+ (2m + 1)! 6 h 

+ (1 - 0/o + (2 - t)8 2 f 0 + • ■ ■ 

+ %t+ '\y {t ~ m) {m + 1 - 08 2m /o- 

By introducing s = 1 — t we may simplify the coefficients of S 2k fo and 
the above finally takes on the symmetric form: 

(17) Q n (x o + th) = sf 0 + ^ 3 7 ' 2) 3 2 /o + • ■ ■ 

4s 2 - l 2 )- • (j 2 - m 2 ) 

+ ( 2 m + 1 )! 6 }o 

t(t 2 - l 2 ) 

+ (A + -- 3, - 8 2 /i + • • ■ 

?(/ 2 - l 2 )- • (t 2 - m 2 ) 2 
+ (2m + 1)! 6 h ' 

This is known as Everett's form of the interpolation polynomial. 


3.3. Practical Observations on Interpolation 

In this subsection, we gather and comment on some of the “rules of 
thumb” which are used by the practitioners of interpolation. 

(a) A convenient “rule” to determine approximately the magnitude of 
the error in linear interpolation is 

|/(x 0 + th) - tf(x 0 + h) — (l - t)f(x 0 )I < 


A y(* 0 ) 
8 
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The factor | is an upper bound for t(t — l)/2, if 0 < t < 1, while by (7) 
h 2 f"(Q ~ A 2 f(x 0 ) in the remainder term (13). 

(b) A “rule” for estimating the magnitude of the error in general 
polynomial interpolation is to use the magnitude of the first neglected 
term in the Newton form. That is, the error in using (12), (15a), or (15b) 
is approximately the next term in the series. By (13) this estimate is seen 
to be good if the ratio / <n + 1> (0 // <n+1> ( 1 ?) IS near t0 1 for f and y in an 
interval containing jc and all the indicated jc t . 

(c) In a table of differences, we may compute the “average value,” A p , 
of a column of pth order differences from the definition: 



It is easy to show (see Problem 7), that if an isolated error is made in /*, 
for some k satisfying p < k < n — p, then A p is unaffected. This observa¬ 
tion provides a simple way of locating k and estimating the p + 1 errors 
that arise in the column of pth order differences from an isolated error in 
f k ; and hence yields the error in f k approximately. 

A table user could difference a printed table (whose accuracy has not 
been established) in order to weed out isolated typographical errors and 
to decide upon the order of interpolation that may be necessary. 

(d) In the construction of a mathematical table, one tries to present a 
listing that provides a reasonable number of decimal places (or significant 
figures) and also permits a simple interpolation process to attain almost 
the full accuracy of the table, without its being too voluminous. To this 
end, some table makers list not only f(x t ) but S 2 /(Xi), where £ 2 /0x t ) is 
called a modified second difference (only the significant figures of § 2 are 
listed). The modification is based on the use of the Everett form of the 
interpolation formula (17) with m — 2: 

/(* o + th) S tf( Xl ) + /(f2 ~ 1} [syuo + 84/(Xi) ] 

+ */(*„) + s(s 2 r- x) - [sy(x 0 ) + sy(* 0 )], 

where s — 1 — t. In order to incorporate most of the “effect” of the fourth 
difference into the second difference, one uses an “average value” for the 
coefficient ( t 2 — 4)/20. Very simply, since 

f P -fr dp = = -°- 18333 > 

we may define the modified second difference by 

$ 2 /M = s 2 /« - U SY(x), 
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and then use, for interpolation in the table, 

fix 0 + th) = tfixj + — 6 - S 2 /(xi) + sf(x 0 ) + — 6 § 2 f(x 0 ). 

Other, more sophisticated arguments have been given to justify the use 
of other “average values*’ for the coefficient (/ 2 — 4)/20, e.g., —0.18393 
(see Problem 6 for justification). 

3.4. Divergence of Sequences of Interpolation Polynomials 

It is not generally true that higher degree interpolation polynomials 
yield more accurate approximations. In fact, for equidistant points of 
interpolation one should use polynomials of relatively low order. We shall 
illustrate this by examining the interpolation error, J? n (x), as a function 
of n and * for a particular function. 

Specifically we take a function considered by Runge: 

08a) fix) = p-lp, 

and consider, in [ — 5, 5], the equally spaced points 

(18b) x y = -5+yAx, j = 0, 1,2,...,/!, Ax = ~ 

For each n there is a unique polynomial P n (x) of degree at most n such that 
P n (Xj) = /(•*;)• This is the interpolation polynomial for (18a) using the 
points (18b). We shall show that |/(x) — P n W| will become arbitrarily 
large at points in [ — 5, 5] if n is sufficiently large. This occurs even though 
the interpolation points {x y } become dense in [ — 5, 5] as n -+oo. 

The remainder in interpolation is, by (1.8) 

(19) R n (x) = f(x) - P n (x), 

n 

= n i x - x i)f[xo, ...,x n ,x]. 

y = o 

However, with the function and points in (18) we claim that 

f _ iy + i fl, if n = 2r + 1; 

(20) /[x 0 ,..., x n , x] = f(x) -^—^- \ 

n (i + x, 2 ) u ,fn = 2r ■ 

We first prove this for the odd case, n = 2r 4- 1, by induction on r. Note 
that in this case there are an even number of interpolation points in (18b) 
and they satisfy x j = — x n _ y . For r = 0 we have n = 1 and x 0 = — x x . 
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Then a direct calculation from the divided difference representation (1.4) 
using (18a) yields 

/[*0, *1, x] =- (j + 3^)(y + 3^2) = Ax) J ( + 3^2’ 

and the first step of the induction is established. Now assume (20) to be 
valid for n — 2r + 1 , with any x 0 ,. . x n that are pairwise symmetric 
(i.e., Xj = — x n _ y ), and let /w = « + 2 = 2(r+l)+l. We define the 
function g(x) by 

g(x) =f[x u x 2 ,.. ., x m _ x , x], 
where now x ; - = — x m „y, and use (1.6) to write 

/[x 0 , x l9 ..., x m , x] = g[x 0 , x m , x]. 

However, by the inductive hypothesis it follows that 

(— l) r + 1 

g(,x) — fix) • A r , A r = --- 

n (i + x* 2 ) 

j = i 

Also from (18b) we note that x 0 = — and hence by the previous cal¬ 
culation 

g[* 0 , X m , x] = f(x) A r , 

I T ^0 

which upon substitution for A r concludes the induction. 

The verification of (20) for n — 2r is similar to the above and is left to 
the reader. Another proof of (20) is given as Problem 9. 

Since (x — x ; )(x — x n „ ; ) = x 2 — x, 2 by (18b) and recalling that 
x r = 0 for n = 2 r we have 


( 21 ) 


n n (x 2 - x, 2 )- 


1 

l/x 


if n — 2r + 1; 
if n = 2 r. 


From (21) and (20) in (19) the error can be written as 

(22) R n (x) = (- 1 r l Ax)gn(x), gjx) = n T Tri - 

j — Q 1 * X j 


Note that < f(x) < 1 for x in [ — 5, 5] and so the convergence or 
divergence properties are determined by g n (.x). Further, since g n (x) = 
g n ( — x), we need only consider the interval [0, 5]. [It is also of interest to 
note that 7? n (x) is, in fact, an even function of x for all «.] 




[SeC. 3.4] DIVERGENCE OF INTERPOLATION POLYNOMIALS 277 

To examine |g n (A)| for large n , or equivalently for large r, we write 
(23a) \g n (x)\ = [c^mi *»<*>! ]i/A* f 

where from the definition in (22) 

(23b) Ax In |g„(x)| = j? ln Ax ' 

j = 0 i -t - Ay 

In Problem 8 we show that for appropriate a E [1, 5] and for all A y -: 

(24) |a T Xj\ > C | A a | m . 

For these values of a the sum in (23b) converges uniformly as « —> oo; 
that is, explicitly, 

lim Ax ln |g n (x)[ = lim ln ~i T ~a 

n OO r-» oo yTg 1 + Ay 

ro r 2 £2 

(25) = j>7TF*’ 

= ?(x). 

To demonstrate this convergence, we note that 

j^.2 _ ^ 2 

In _- L = l n |jc 4- a ; -| + ln |a — a ; -| — ln 11 + Ay 2 ], 

1 H - Ay 

and similarly, with Ay replaced by £. Next we show that each of the 
three sums converges to the corresponding integrals. Those sums corre¬ 
sponding to the last two terms converge to their corresponding integrals 
by the definition of Riemann integrals since the corresponding integrands 
are continuous functions of £ [recall that a > 1 by (24) and Ay < 0 for 
j < r by (18b)]. Hence, we need only show that 

' f° 

(26) lim > In |a + Ay| Aa = ln \x + £\ d£, 

r - oo /To J - 5 

provided a satisfies (24). Let 8 < 1 be an arbitrarily small fixed positive 
number. Then 


(27a) 


lim y ln |a + a#I Aa 

r “* 00 |jc + Xj \ > 6 




In |a + £\ d£ 
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since In |x + f| is a continuous function of f over the indicated intervals 
of integration. The missing part of the integral in (26) is 

(27b) f * + <5 In |x + f| di = 2 f In rj drj - 2(8 In 8 - 8). 

J- x-6 Jo 

The remaining part of the sum in (26) can be bounded if we take r so 
large that Ax < 8. Then recalling (24) we have 



2 In |x + x } \ Ax 

< 2Ax 

In c(Ax)'” 

+ V In lx + xJ Ax 


lx + x/l<<5 



|Ax< |x + Xj\<6 

1 

(27c) 

3 

<N 

VI 

In c(Ax)'” 

+ 2 |J In t i dr] 

> 


2(Ax| m In Ax + In c\ -h 8|ln 8 — 1|) 
= 0(8|ln8|). 


The first term on the right in the first line is obtained from the two terms 
say, x k and x k + 1 which are nearest to — x. The remaining sum has been 
bounded by the integral by means of the monotonicity of the function 
In x. That is, since 8 < 1, we use 

r\x + xj\ 

0 > Ax In |x + Xy| > In rj dr), 

J\x + xj - x! 

if |x + Xy_ 1 1 < \x + Xy| (otherwise limits of integration are [x + x,[, 
|x + Xy + 1 |). Letting r^oo in (27c) and using (27a and b) we get (26) 
since 8 is arbitrarily small. This concludes the proof of (25). 

In Problem 10 we indicate how ^(x) can be explicitly evaluated and it is 
required to show that 

(28a) < 7 (x) = 0 at x = 3.63...; 

(28b) < 7 (x) < 0 for |x[ < 3.63...; 

(28c) < 7 (x) > 0 for 3.63... < |x| < 5. 

Now let x satisfy (24) and x > 3.63.. . as n —^oo. Then by (25) and (28c) 

in (23a) we have, recalling that Ax = 10 jn, 

lim | £„(*)[ = co- 

n -* 0 

That is, from (22), |^ n (x)| ->oo as n-+cc for x as above. Also, since 
R n (x) = R n { — x) the points of divergence are symmetrically located on the 
axis. 

This example illustrates part of the general convergence theory for 
sequences of interpolation polynomials based on uniformly spaced points 




[Sec. 3] 


PROBLEMS 279 


in an interval [ a , b]. According to this theory, if /(z) is “analytic” in a 
domain of the complex plane containing [a, b], then the sequence of inter¬ 
polation polynomials for /(z) converges inside the largest “lemniscate” 
whose interior is in the domain of analyticity of/(z). The “lemniscate” 
passing through z = ± 3.63. . . also passes through the points z = ±V -1 
at which /(z) = 1/(1 + z 2 ) is singular. 

The “lemniscates” associated with an interval [a, b] are simple closed 
curves which are analogous to the circles that characterize the domain of 
convergence of a power series expansion about a given point. That is, 
the sequence 


S n (z) = 


v / <fc> («) 
*4 k\ 


(•* - «) k 


converges to the function/(z) for all z inside the largest circle \z — a\ — r 
about a in which /(z) is analytic. For /(z) = 1/(1 4- z 2 ) we obtain the 
sequence 

L'ln(z) = 2 (“ l) k Z 2k , 

k= 0 

which converges for |z| < 1. This is the largest circle about the origin not 
containing the singular points z = ± V — 1. 


PROBLEMS, SECTION 3 

1. Without using A n /M = /t n / (n) (f), derive the value of A n / > n (jc) where 
P n (x) = a 0 + a x x + • • • + a n x n . 

2. Find the errors in the following table of function values taken at evenly 
spaced arguments: 50173, 53503, 56837, 60197, 63522, 66871, 70226, 73566, 
76950, 80320, 83695, 87084, 90459, 93849, 97244, 100634, 104049, 107460. 
In this example it suffices to examine the column of second differences. 

[Hint: See Problem 7.] 

n 

3. Compare error factors, ]~J (x — x f ), in equally spaced and Chebyshev 

y = o 

interpolation over [a, b]. (That is, use Stirling’s approximation to n\ and 
verify the estimates of A/ C h/A7 E q and A/ch/^Eq*, following Lemma 2.) 

4. Derive the result that A7r n (r) = (n + 1)t7„_!(/), n = 1,2,...; where the 
spacing in A is h = 1. 

5. * Prove the following 

theorem. If p n (x) is a polynomial of degree at most n and max |/? n C*)| = M 

1*1 * 1 

then for all x in \x\ > 1, 

|/7nW| ^ M2 n ~ 1 T n (x) = M cos (n cos~ 1 x). 

[Hint: For a proof by contradiction, consider the polynomial 

Qn(x) = p n (£) ~ PnW, 

with £ a point where the conclusion is invalid show q n (x) has n -f 1 zeros.] 
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6.* The technique of L . /. Comrie for modifying the second difference uses 
the idea of selecting a constant, — c, to replace {t 2 — 4)/20 and ( s 2 - 4)/20 
in the Everett formula so that the maximum error in the resulting interpolation 
formula is a minimum. Supply the missing details in the following sketch: 

Let the Everett form of the interpolation polynomial be 

P 5 (x 0 + ph ) = qu 0 + pu x 4- e(q)S 2 u 0 4- e(p)B 2 u 1 4- d(q)h*u 0 + d(p)8*u u 


where 


Then 


„ , _ , , pip 1 - 1) . PiP 1 - 1 )(p 2 - 4) 

9=1 ~ P, e{p) - -g-. d{p) =-- 


Ps(x 0 + ph) = qu 0 + pu x + e(p)(8 2 ui — c8 4 «i) + e(q)(8 2 u 0 — c8 i u 0 ) + R, 
with 

R = [^(P) + ce(p)]8 4 w x + [d(q) + ceto)]S 4 w 0 . 

If we try to pick c so as to minimize max \R\ y we see that c must depend 

OSpSl 

on u. 

Hence, we simplify the problem by noting that 6 4 w x = S 4 w 0 + A8 4 w 0 - 
Now if A8 4 « 0 is much smaller than 8 4 w 0 we may neglect the fifth difference and 
minimize 


max | d(p) + d{q) + c[e(p) + e(q)]\ 

Ospsl 


= max 
ospsl 


P(P 2 - 1 )(P - 2) 
24 


+ c 


P(P ~ 1) 
2 


If we let the polynomial inside the absolute value sign be g(p , c), then the 
maximum occurs when 


f p = UP ~ i)(p 2 - P ~ 1 + 6c) = 0 

or p = i ± V|- — 6c. Since — 1) is of one sign, | g(p, c)\ should have 
equal values at its maxima, in order that they be minimum. Set 

l*ft)l = k(i ± vpr^)| 

which yields 

3 - 16c (1 - 6c) 2 

128 ~~ 24 

Only the larger one of the roots, c, is appropriate: c £ 0.18393. 

7. Given a table of f(x) for x t = x 0 + jh , 0 < j < n. Show that if f k is 

n - q 

replaced by f k + 8, for any k with p < k < n — p then 2 AV}/(/i—<? +1) 

i = o 

is unaffected, for 1 < q < p. 

8. For x a fixed positive irrational algebraic number of degree m, Liouville’s 
theorem states that for all positive integers (p, q) 3 a constant K (x) 3. 

\x - p/q\ > Kq~ m . 

Show that this implies (24) for some constant c(jc) and all jc } . 
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9. Verify that if f(x) = l/(x + c) 

f[x o, x lt . .., x n ] = (-1)" —- l - - 

ft (*i + c) 

1 = 0 

Hence, establish (20) by writing 


1 

= 1 / 1 

1 \ 

1 + X 2 

2i \x - i 

x + if 


where i 2 = — 1. 

10. Verify for the function 

/• 5 I _ £2 

«*> -1'» |rr£ * 

that 

(a) ?(a) = 0 at x ^ 3.63...; 

(b) q(x) < 0 for |jc| < 3.63. . .; 

(c) q(x) > 0 for 3.63... < |x| < 5. 

[Hint: Derive for 0 < x < 5, 

q(x) = f ln (x - d£ + f In (f - x) dt; 

Jo Jx 

+ f 5 (In (f + x) - ln (1 + e)} d£ 
Jo 

= (5 + x) In (5 + x) + (5 - x) ln (5 - x) 

- 51n 26-2 arctan 5.] 


4. CALCULUS OF DIFFERENCE OPERATORS 

When dealing with equally spaced data there is a very useful operator 
method available to suggest new formulae and to aid in recalling the 
fundamental ones. The basic operators are: 

(a) Identity If(x) = f(x ); 

(b) Displacement Ef[x) = f(x + h ); 

(c) Difference A/M = f( x + h) — f(x); 

(d) Derivative E>f{x) = ~!j~’ 

Note that the displacement and difference operators imply a fixed spacing, 
A, by which the argument is to be shifted. We assume that E and A use 
the same such value unless otherwise specified. To employ D, the function 
on which it operates must be differentiable. In fact, the classes of functions 
to which all symbolic formulae apply must generally be restricted. We 
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shall consider only the class of polynomials (but more general extensions 
are possible). Two operators, A and B , are said to be equal if Af(x) = 
Bf(x) for every function f(x) of the class under consideration, i.e., for 
every polynomial. 

From the definitions (1), it is clear that the four operators are linear, 
i.e., if A is any one of them then 

(2) A [af(x) + pg(x)] = aAf(x) + PAg(x), 

for arbitrary numbers a, p and functions /(x), g(x). The product, AB , 
and the sum, A + 5, of two operators A and B are defined by 

(3a) (AB)f(x) = A[Bf(x)l 

(3b) (A + B)f(x) m Af(x) + Bf(x). 

From the definition (3a) the integral powers, A n , of any operator A may 
be defined inductively as 

A 0 = /, 

(4) 

A n = AA n ~\ n — 1, 2, .... 

In addition, we define non-integral powers of the displacement (or shift) 
operator, E\ by 

(5) E/(x) = f(x + sh), 

where 5 is any real number, and observe that E s E r = E s + T . 

Using the definitions (la), (lb), and (3b) we have 

Ef{x) = f{x + h) 

= fix) + fix + h) -fix) 

= (/ + A )fix). 

Thus, we conclude, from the definition of equality of operators, that 

(6) E = / + A. 

Equivalently, we have A = E — / and from the definition of powers of 
operators it easily follows that 

(7) = (E - iy. 

This result may be proved by induction, just as is the usual binomial 
expansion. However, by applying (7) to f(x 0 ) we obtain (3.6) which has 
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previously been derived for general functions. Thus (3.6) yields an in¬ 
dependent proof of (7). 

From the extended definition (5) we may write 

fix + sh) = E s f(x). 

On the other hand, the Newton form of the interpolation polynomial 
gives 

f{x + sh) = [l + + -jp A 2 + • • • + + • • •]/(*), 

where we note that the series terminates with A p if f(x) is a polynomial 
of degree p. 

But the formal binomial series expansion of (/ + A) s , for s arbitrary, 
is identical with the series on the right-hand side, and hence we adopt it as 
the definition for s non-integral. That is, with this convention, 

(8) E s = (/ + A) s for s arbitrary. 

Thus, it is clear that the steps leading to (8) are not a derivation of New¬ 
ton’s formula but can now be used to recall that formula when required. 

Similarly, such manipulations can be employed to suggest new formulae 
which can then be verified independently. For example, we define the 
backward difference operator, V, by 

V/(x) =/(*)-/(*-A). 

Then as in deriving (6) we find that 

£- 1 - (/ - V), 
and proceeding as in (8) we obtain, formally 

(9) f(x-sh)= [l - ^ o(i )V+ ^^-V 2 + .. 

+ (_I) ‘ + '(F+TT! V ' C+I +•••]/(*)• 

The formula suggested is, in fact, well known as Newton's backward 
difference formula. By introducing centered difference operators, 

2ft*) */(* + £) -/(jc-j). 

we could derive other formulae. 

To relate D to the other operators we write the formal Taylor’s series 
expansion 

fix + h) = fix) + yj'ix) + jJ"(x) + ■ • • + £f'Xx) +■■■ 
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in the symbolic form 



Since the series in the brackets is the expansion of e hD and the equality 
is valid for all polynomials, we have the interesting result: 

(10) «*" = £=/ + A. 


[We recall that this merely says that the operator E and the operator 

I + hD H-h h n D n /n\ are equivalent when applied to a polynomial 

of degree n , for all positive integers n.] Formally by taking logarithms in 
equation (10) we find that 

(11) hD = In (/ 4- A) 

= A - iA 2 + $A 3 -+ (-1)" + i1a" + •••. 

This formula suggests that hD and the first n terms on the right-hand side 
might be equivalent when applied to any polynomial of degree n . To verify 
this we use the Newton formula, (8), for any polynomial f(x) of degree n : 

f(x + sh) = |^1 4- jA 4- - - ■ - ^ i ~ A 2 H- 


+ 


s(s — I) - • '(s — n + 1) 
n\ 



Differentiate with respect to s and evaluate at s — 0 to get 


hf\x) = [a - iA 2 + $A 3 - • • • + (-1)- 1 i A"l/(x), 


which was to be shown. The relation (11) may now be employed to obtain 
forward difference approximations to the derivative of a tabulated func¬ 
tion. The general problem of approximating the derivatives of a function 
is considered in more detail in the next section. 

Symbolic methods may also be employed to determine formulae for the 
approximate evaluation of integrals. Thus we define 


( 12 ) 


jf(x) m J** h m di. 


and by using (5) we have, formally. 
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(13) 


Jf(x) = h f E°f(x) ds 
Jo 

— h [ E s ds f(x) ~ h \ e slnE ds f(x) 

Jo Jo 


E — / 

= h ^-4f{x) 


In E 
hA 


Ax). 


In (/ + A) 

By using (11) this result implies J = A/D or JD — DJ = A which may be 
verified directly. However, if we write 


where by definition 


in (/ + A) = A(/ - R\ 
1 


R = iA - ^A 2 + *** + (— l) n ~A n ~~1 + ... 


then (13) becomes, symbolically, 



- h(l + R + * 2 -*■)• 

Again this might be interpreted as meaning that when applied to any 
polynomial f(x) of degree n , J is equivalent to the first n + 1 terms on the 
right, or more simply just those terms involving A* for k < n. Usually, 

(14) is written in powers of A, i.e., 

(15) J = h(I + ±A - -r 2 -A 2 + -A-A 3 - 7 ¥oA 4 +•••)- 

To justify (14) we note that Jx n = Ax n + 1 /(« 4- 1). Now (11) and the 
definition of 7? give 

hDx n + l = /i(tf + l)* n 

= A(7 - 7?)jt n + 1 
= Ajc n +1 - RAx n + 1 . 


Thus, Ax n + 1 = h(n 4- l)x n + RAx 71 + 1 and iterating this result yields, 
since R n+1 Ax n+1 = 0, 

Ajc n + 1 = h(n + \)x n + R[h(n + l)x n + RAx n+1 ] 


= /z(« + !)(/ + i? + 7? 2 + * • • + R n )x\ 
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Thus, we have shown that 

(16) Jx n = h{I +/?+■■• + R n )x\ 


Since this holds for all integers n , the validity of (14) when applied to 
polynomials follows. Different expressions for J can be obtained by using 
other identities to eliminate E in the derivation of (13). However, Chapter 7 
is devoted to the detailed study of approximation methods for the evalua¬ 
tion of integrals. 

The symbols 1/7) and 1/A are not operators in the same sense as 7), 
A, E, etc., since 


(17) -jr/(jt) = {F(x)} = the set of polynomials F(x) such that 

F\x) =/(*). 

(18) j- f{x) = {(7(;t)} = the set of polynomials G(x) such that 

' G(x + h) — G(x) = f(x). 

Nevertheless, if f{x) is a polynomial of degree «, (F(x)} and {C?(jc)} have 
the same structure, that is, 


Hence, 


{F(x)} = {P n + 1 (x) + c}, c any constant, 

? ft + iW a fixed polynomial of degree n - 1-1. 
(G(jc)} = {(? n + i(x) + d }, d any constant, 

Q n + i(x ) a fixed polynomial of degree n + 1. 


(£ p 


E9) D 


and 




are well defined operators. We leave as Problems 2 and 3 the proof that 
the corresponding formal power series in A respectively satisfy (19) and 
( 20 ): 


(19) (£”-£<) i/(x) 


[(/ + A)” - (/ + A)<] 


log (/ + A) 


/« 


1 


x + ph 


/(f) dL 


where f(x) is a polynomial; 

(20) (E* - E«) ±f{x) = [(/ + A) p - (/ + A)<] ±f(x) 


= 2 fix +jh), 

J — Q 

if p > q are integers and f(x) is a polynomial. 
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Equations (19) and (20) permit the development of formulae for inte¬ 
gration and summation of polynomials [see Problems (4) and (5)]. Another 
kind of representation of (20), with q — 0, arises from replacing the term 
1/A on the far left by the formal power series in D by setting, from (10), 


( 21 ) 


2 

A 


- / 


h 2 D 2 h 3 D 3 

hD H- 7TZ -1-Z-T-h 


2! 


3! 


1 

hD 


1 

^ + 


12 


hD - ^ h 3 D 3 + 


1 


30,240 


h 5 D 5 -. 


We obtain the formula, 


P-l 1 px+ph 

(22) 2 A* + M = I /(f) - iLA* + ph) - /(*)] 

; = 0 n Jx 

+ j2 t/'( x + P h ) - fix)] 

- ^ [f 3 \x + pA) -/ (3, W] 

+ 30^40 [ / <5>(x +^) -/ <5, W] - 

called the Euler-Maclaurin summation formula. 


PROBLEMS, SECTION 4 

1. Verify that the factorial polynomials t¥ 0 (x) = 1, W n {x) = x(x - h)> • • 
[* — (« — l)/i], « = 1, 2,satisfy A^ 0 (x) = 0, AlT n + 1 (x) = A(n + 1) 
W n (x ), and may be used as a basis for polynomials on which the calculus of 
difference operators is applied, e.g., with 

Pn(x) = 2 a, Wj(x), 
i = o 

&Pn(x) = h 2 j a t W t - lW- 
)= 1 

2. Verify (19). Hint: [{(/ 4- A ) p_1 + (/ + A ) p_2 + ... + (/ + A.)«}{(/ + A) - /} = 

(/ + A) p — (/+A) 9 .] 

3. Verify (20). [Hint: Same as in Problem 2.] 

4. Use (19) with p = 2, <7 = 0 to find Simpson’s rule valid for all poly¬ 
nomials of degree < 3; 

r x+2/i L 

J /(f) rff = 5 [/(*) + 4/(x + A) + /(x + 2A)]. 




288 DIFFERENCES, INTERPOLATION, AND DIFFERENTIATION [Ch. 6] 

5. Use (20) with f(x) = x 3 y h = 1, q = 0, p = n + 1, x = 1 to get a 

n 

simple explicit expression for 2 7 3 - (Construct a table of differences for 

i = i 

7 = 1,2, 3, 4.) 

n 

6 . Use (22) to derive the formula for 2 7 3 - 

y = i 

7. * Prove the Euler-Maclaurin summation formula (22) is correct for poly¬ 
nomials. [Assume that you can use (21) to formally get an infinite series for 
U(e hD - /).] 


5. NUMERICAL DIFFERENTIATION 


A problem of importance in many applications is to approximate the 
derivative of a function, being given only several values of the function. 
An obvious approach to this problem is to employ the derivative of an 
interpolation polynomial as the desired approximation to the derivative 
of the function. This can also be done for higher derivatives, but clearly 
the approximation must, in general, deteriorate as the order of the deriva¬ 
tive increases. We have seen in Section 3 that the interpolation error factor 
is least near the center of the interval of interpolation (for equally spaced 
points), and indeed an analogous result is also true for numerical 
differentiation. 

Denote by P n {x) the «th degree interpolation polynomial for f{x) with 
respect to the n 4- 1 distinct points x 0 , jc 1s ..., x n . Then as an approxima¬ 
tion to 


d k f(x) 

dx k 


= f k \x). 


with k < n, we use P <k \x), However, to assess this approximation we 
require some convenient representation for the error: 

(1) = f«\x) - P*\x). 

If / (n + 1) (*) is continuous in the interval, I x , which includes the jc y and x, 
it has been shown in Theorem 2.1 of Chapter 5 that 

n f(n + 1)( C\ 

Rn(x) = H “ Xi) (n + 1)!’ 

where £ = f(x) is an unknown point in I x for each x. It is tempting to 
differentiate this expression for R n (x) in order to obtain R^Xx) but this 
is not generally legitimate. First of all, £(x) may not be single valued, 
let alone differentiable k times and secondly, f(x) may not be n + 1 + k 
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times differentiable. If f{x) does have these differentiability properties, 
then another alternative is presented by recalling (1.7) in the form 

n 

Rn(x) = n (* - x i)fl x 0> •••>*»> *]• 

i — 0 

It now follows by an application of Theorem 1.2 that this representation is 
k times differentiable. However, the resulting expression is rather compli¬ 
cated and only useful in the case of first derivatives, k = 1, in which 
x — x t is one of the points of interpolation. The error becomes in this 
special case 

71 

(2) f'(xi) - P n ’(x 0 = R„'(x i) = FI ( x ‘ - 0. *i> ■ ■ •> X n , X,] 

) = o 
a * o 


n 

= U(Xi- X j) 


;' = o 
(;' * i) 


/ (n + 1) (^) 

(« + 1)!* 


The last expression in (2) can be deduced from Theorems 1.2 and 1.1. 

To obtain practical error estimates for numerical differentiation in the 
more general case, we return to Rolle’s theorem which was the basis for 
the original interpolation error estimates of Theorem 2.1 in Chapter 5. 
The results may be stated as 


theorem 1. Let the interpolation points be ordered by x^ < x x < * < x n . 
Let f (n+1 \x) be continuous. Then for each k < n, 


(3) 


n - k 


K k Xx) - n (* - £j) 

j = 0 


f ,n+, Xv) . 
(« + 1 - A:)!’ 


where the n + 1 — k distinct points , are independent of x and lie in the 

intervals 

(4) Xj < f j < Xj + fc , j — 0, 1, ..., n — k ; 

and t] = 7](x) is some point in the interval containing x and the 

Proof Since R n (x) = f(x) — P n (x) has n + 1 continuous derivatives 
and vanishes at Jt = jc y , j = 0, 1 ,.. n, we may apply Rolle’s theorem 
k < n times. In applying this theorem we can keep track of the location 
of the implied zeros of the higher derivatives of R n (x) by means of Table 1. 
The kth column lists the open intervals, (x jy x j + k \ in each of which (by 
means of Rolle’s theorem) at least one distinct root, of R ( n\x) must lie. 
Thus the points £ of (4) are defined and we note that they depend only 
upon the function f(x) and the interpolation points Xj but not upon x. 
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Table 1 Zeros of Higher Derivatives of R n (x) 


RnW 

WXx) 

RT(x) 

K k, (x) 

x 0 




Xl 

( Xo, Xi) 



x 2 

(Xu X 2 ) 

(x 0 , X 2 ) 


X 3 

(*2, X 3 ) 

(Xl, x 3 ) 


X k 

X k ) 

(Xfc-2, Xfc) 

(x 0 , Xfc) 

Xn 

Or. - 1, Xn) 

(x n - 2 , X n ) 

(Xji - kt X n ) 


We now define the function 

F(z) = R*\z) - a n ( z - &)• 

1 = o 

and note that F(^) = 0 for j = 0 , 1 ,.. n — k. For any fixed jc distinct 
from the we pick a = a(x) such that F{x) = 0. Then F(z) has« — k + 2 
distinct zeros and we may apply Rolle’s theorem again [noting that F(z) 
has n — k + 1 continuous derivatives]. We deduce that F (n ~ k+1) (z) has a 
zero, say at t }, in the interval containing x and the f y . From this result 
follows: 


0 = F <n-k += ^ n + 1 >(7 ] ) - «(/i - k + 1)! 

= / (n + 1) M -«(»-*+ 1)!, 

or 

/ <n + 1) W 

a “ (n - k + 1)!* 

By using this value of a in F(x) = 0 the result (3) follows for all x. That is, 
(3) holds also for x = f y with arbitrary rj since F(£ y ) = 0 for arbitrary 


The expression (3) for the error in numerical differentiation is valid for 
all x and so is of much more general applicability than expressions of the 
form (2). Using the known intervals (4) it is possible to obtain bounds on 
the error. For instance, if x and the x j all lie in [ a , b] and in this interval 
|/ (n + 1) (x)| < M then clearly, 




M\b - a\ n ~ k+1 
(n-k + 1)! 
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Sharper estimates may be obtained by a more careful use of the inequalities 

n - k 

(4) in bounding the error factor, Y\ ( x “ £;■)• 

/ = o 

There are other ways to determine numerical differentiation formulae 
and their errors. Suppose the value f {k \a) is to be approximated by using 
the values f(x t ), i = 1, 2,..., m. With an f(x) that has n + 1 continuous 
derivatives where n + 1 > m we define h t — a and use Taylor’s 

theorem to write 

/O0 = f(a + hi) 

= f(a) + h,f a \a) + ^f 2 \a) +■■■+ ^f ,n> (a) 

+ ( + V ( a + - 1 . 2 ,.. m. 

Here, of course, 0 < 0 { < 1. We now form a linear combination of these 
equations with weights, a u to be determined. 


2 “</(**) = 
< = i 

(5) 



+ • • • 



1 

(n + D! 


J a ( Ap + 1 / ( " + l, (fl + Ofa). 


We choose the a t in order that the linear combination of the values /(jc t ) 
be the most accurate approximation to f {k \a). Thus we impose the m 
conditions on the m unknowns a t : 


m 

2 ~ 


v = 0 , 1 ,..m — 1 . 


It is clear that the system (6) has a unique solution since the coefficient 
determinant is a Vandermonde determinant. Thus a necessary and sufficient 
condition for (6) to have a non-trivial solution is that it be non-homo- 
geneous, i.e., m > k. Hence, in order to approximate a kth derivative we 
need more than k points. With the solution of the system (6) in (5) we obtain, 
recalling that n + 1 > m 

m t / m \ 

f k) (a) = 2 «</(*■) - ^ [2 “A m )/ <m, («) 

( 7 ) 

1 / m \ 1 rn 

- jj] (2 “Ajrw - I «A" +1 / <n+1) (« + W, 


m > k. 
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This procedure is equivalent to what may be called the method of 
undetermined coefficients : if we are given m function values, /(*<), we seek 
that linear combination of the values at these points which would give the 
exact value of the derivative f (k) (a) for all polynomials of as high a degree 
as possible, at the fixed point a. Specifically for the first derivative, since 

dx v 
dx 

we seek a t such that 

m 

(8) 2 a t^i v = v = 0, 1,.. m — 1. 

i - i 

This system also has a unique solution and it is, in fact, the same as the 
solution of the system (6) with k = 1 (assuming that the quantities x y , h f , 
and a are related by hj = x i — a). This verification is posed as Problem 1. 
In the present derivation of the approximation formula no estimate of the 
error term is obtained but this could be remedied. It should also be 
observed that the method of undetermined coefficients can be used to 
determine approximations to higher derivatives. 

5.1. Differentiation Using Equidistant Points 

Naturally the numerical differentiation formulae are somewhat simpli¬ 
fied when equally spaced data points are used. For instance, the operator 
identity (4.11) yields approximations of the form 

(9a) f'{x) = I [A/(x) - ±A *f[x) +•■■+(- 1)" +1 ±A»/(*)] + R n \x). 

Here the data required are /(x), /(x 4- h), .. .,/(x + nh), so that this 
formula only approximates the derivative at a tabular point and uses 
only data on one side of this point. This formula is obtained by differentiat¬ 
ing Newton’s forward difference formula (3.12) and evaluating the result 
at t = 0. Thus the error determined in (2) is applicable and becomes in 
this case 

(9b) R n \x) = + x<r,<x + nh. 

Another example is furnished by differentiating the Gaussian form of the 
interpolating polynomial, say (3.15a) with n = 2m, and again setting t = 0, 

(10a) f{x 0 ) = l [a/(*o) - yi A2 /(*- 1 ) - J, * 3 /(*-i) + • • • 

+ (— l) m 1)! A 2m /(*- m >] + R*M. 




[Sec. 5.1] 


DIFFERENTIATION USING EQUIDISTANT POINTS 293 


Here the tabular points involved are symmetrically placed about x 0 and 
again the error formula (2) is valid: 

(10b) R 2m '(x o) = (- l) m h”f« + “(,), 

x 0 — mh < 7) < x 0 + mh . 

For n and m » 1 Stirling’s approximation for n\ implies, since n = 2m, 

(ml) 2 _ n~ i/2 V2n 

(2m 4- 1)! ~ 2 n+1 

Thus a comparison of (9b) and (10b) indicates, for differentiation, the 
superiority of centering the data points about the point of approximation. 
An important special case of (10) occurs for n = 2; this can be written as 

(11) /'(*) = /(* + h ) ~ A* - h) _ X - h < r, < X + h. 

The approximation formula in (11) is called the centered difference approxi¬ 
mation to the first derivative. 

The second derivative, or in fact, any even order derivative, can be 
approximated by a centered formula obtained by differentiating the other 
Gaussian interpolating polynomial, (15b) with n = 2m + 1. For example, 
with n — 3 the approximation of f"(x 0 ) becomes on setting x 0 = x: 

(12a) f”(x) = + R'f (x ) 

f(x + H) - 2f(x) + f(x - K) + 

The error term can now be estimated by Theorem 1: 

(12b) \Rf\x)\ < h 2 \f iA \rj)\, x-h<j ] <x + 2h. 

But a better bound for the error can be found by the Taylor expansion 
procedure indicated by equations (6) and (7). That is, if we set k — 2, 
m = 4, a = x, and h t = (/ — 2)h, i — 1, 2, 3, 4, we find that 

1 ~ 2 n 

«1 = «3 “ Jp> «2 = -JT 9 «4 = 0- 

The error expression, given by the last terms in (7), becomes, upon setting 

n = 3, 

m.x) = - % [/ <4, (fi) +/ <4> (^)), <*<&<* + *. 
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But f a \x) was assumed continuous in this derivation so for some f in 
fi < f w e must have M/ <4) (£i) + / (4) (£ 2 )] = / (4) (f)- Thus the error 
is 

(12c) RT(x) = - y 2 P*XZ), X- h< t <x + h. 

Note the improvement over the bound (12b) both in the factor and the 
decreased range of the argument of the fourth derivative. It is also of 
interest to observe that the same approximation (12a) and error (12c) 
are obtained for m — 3 and n = 3 in (7) with the above choice of h t \ 
that is, improving upon the accuracy by using data at one additional point 
is not always possible. (See Problem 2.) 


PROBLEMS, SECTION 5 

1. Verify that the set of coefficients {a 4 } produced by solving the system (8) 
is the same as the solution {a*} of system (6) for k — 1, h { = x { — a. 

2. Verify that if {oc ( } produces the differentiation formula of maximum 
accuracy 

f <k) (x) = 2 «I f{x + ih), 

1= -r 

then 

f k) (x) = 2 Pif(x + ih) 

i= -r 

can be no more accurate, and is of the same accuracy only if jS r + 1 = 0, 
Pi — cc t for / = — r, . . ., + r. Show also that the coefficients satisfy a p = ct_ p 
if k is even; a p = —a_ p if k is odd. 


6. MULTIVARIATE INTERPOLATION 

The problems of polynomial interpolation and approximate differ¬ 
entiation for functions of several independent variables are important but 
the methods are less well developed than in the case of functions of a 
single variable. An immediate indication of the difficulties inherent in the 
higher dimensional case can be seen in the lack of uniqueness in the general 
interpolation problem. That is, we ask if p u p 2 ,. .., p n are n distinct 
points, say in the x, ^-plane, then is there a unique polynomial of specified 
degree which attains specified values, say/(/? ; ), at these points? Clearly the 
answer, in general, must be no since if all of the points [p } ,f(pj)] lie on a 
straight line in x, y , 2 -space then there are infinitely many planes (i.e., 
linear polynomials) and perhaps higher degree polynomials of the form 
z = P(x, y) containing this line. We shall not dwell on these aspects of 
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interpolation in higher dimensions but shall show how to construct 
appropriate polynomials when the points of interpolation are specially 
chosen. It will also be found that in these special cases the interpolation 
polynomials are unique. For simplicity, we concentrate on functions of 
two variables but extension to more dimensions offers no difficulty. 

Let us be given the (m + 1)(« + 1) distinct points p u = (x h y ; ); 
i = 0, 1,..., m, j = 0, 1,...,« and corresponding function values f(x iy y >). 
These points form a rectangular array which is the set of intersections 
of the vertical lines x = x { with the horizontal lines y = y i in the x> y- 
plane. We seek a polynomial, P(x , y) 9 of degree at most m in x and at most 
n in y such that 

P(x { , y } ) =/Uf, y } ), * = 0,1, .. m, j = 0, 1,..«. 

Such a polynomial must have the form 

m n 

(i) p(x,y) = 22 

i=0 j = 0 


with (m + !)(/? + 1) coefficients, a iy , to be determined. This problem is 
easily solved, due to the special form of the points p ij9 with the use of the 
Lagrange interpolation coefficients. Let us write the Lagrange coefficients 
for the points {x t } and {>>,•} as 


( 2 ) 


= n 

fc= 0 


r.M - n 

fc=o yi yk 


i = 0, 1,..., m ; 


7 = 0 , 1 .». 


Then clearly, the polynomial X r m t (x)Y n J (y) is of degree m in x , of degree 
n in y and vanishes when (x, y) = p vu unless v — i and p — j in which 
case it is unity. Thus the required polynomial satisfying P(x h yj) — f{x u 
can be written as 


(3) P(x, y) = 2 2 X m . lx) Y n ,ly)f{x „ y,). 

1 = 0 j = 0 

Since the number of coefficients in the general polynomial (1) of degree 
minx and n in y is equal to the number of conditions imposed we may 
expect that the interpolation polynomial (3) is unique. A formal proof 
of this fact is indicated in Problem L The extension to more independent 
variables is obvious. 

Another representation of the interpolation polynomial (3) can be 
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obtained by using Newton’s divided difference formulae. With the m + l 
distinct points x j we have, recalling (1.7) and (1.8), 

m 

(4) f(x, >0=2 "fc-i(*)/[*o. x lt ...,x k ;y] 

k = 0 

+ “> m (x)f[x o, ...,x m ,x;y]. 

Here we have introduced 

oi.jO) = l; w k (x) = o> fc -i(x)(x - x k ), k = 0, 1,...; 

and the divided differences of a function of several variables are formed 
by keeping all but one variable fixed and taking the indicated differences 
with respect to the free variable. Hence f[x 0 , x u .. ., x k ; y] as a function 
of the independent variable y has the Newton representation, using the 
n + 1 points y y : 

n 

(5) f[x 0 , x u ...,x k ;y]= 2 to /-i(>0/l>o> *i> • • •, x k ; y 0 , y u ..y,\ 

j-0 

+ “> n (y)f[x 0 , ...,x k ;y 0 ,...,y n ,y] 

We use (5) for k = 0, 1 ,..m in (4) to obtain 

fix, y) = Q{x, y) + R(x, y) 

where 

m n 

(6) Q(x, y) = 2 2 ...,x k ;y 0 ,.. -,y t ] 

k = 0 j = 0 

and 

m 

(7a) R(x, y) = <o n (y) 2 “t-iW/K...>- 0) ..jv, y] 

/e = 0 

+ w m (x)/[x 0 ,.. .,x m , x;y]. 

It is clear that R(x t , y y ) = 0 at the (m H- 1)(« + 1) points (x ti y y ) and 
hence by the uniqueness proof mentioned we can conclude that P(x , y) = 
2(x, j). The derivation of the interpolation polynomial also yields an 
expression for the interpolation error, R(x, y). To simplify this expression 
we again use Newton’s formula and the m + 1 points x t to write 

m 

f[x; y 0 ,..y n , y] = 2 «i-i(*)/[*o> • • •. x t ; y 0 , ..-,y n ,y] 

i — 0 

+ ■■■,x m ,x;y 0 ,...,y„, y]. 

If we multiply this identity by o> n (y) and subtract the result from (7a) we 
obtain finally, 

(7b) R(x, y) = a>Jx)f{x 0 ,..x m , x; y] + oj n (y)f[x; y 0 ,..., y n , y] 

- <» m (x)a> n (y)f[x 0 ,..x m , x; y 0 ,..y n , y]. 
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If /(x, y) has continuous partial derivatives of orders m + 1 and n + 1, 
respectively, in x and y and the appropriate mixed derivative of order 
m + n + 2 then by applying the obvious extension of Theorem 1.1 the 
error becomes 


(7c) 


Rf v = d m+1 f(t,y) , < »n(y) d n+1 f(x,y) 

1,y) (m + 1)! dx m+1 (n + 1)! dy n+l 

a m+ft+2 /(r,V) 

(w + !)!(«+ I)! dx m +1 dy n +1 


This error formula is not of the form of the two-dimensional Taylor series 
error term, as was the case in one dimension, since different orders of 
differentiation occur here. 

By specializing the interpolation points to be equally spaced we can 
obtain special forms of (3) and (6). These forms may be written in terms 
of the difference operators of Section 4, generalized so that they operate 
with respect to a particular independent variable. An example of such a 
representation is to be found in Problem 2. 

The interpolation problem solved above, by (3) or (6), does not specify 
the degree of the polynomial in question but rather the maximum degree 
in x and y 9 separately. If a polynomial in two variables is to have total 
degree at most«, say, then it must have the form 


( 8 ) 


Pn(x, > 0=22 b Ki xk y- 


We note that the coefficients b kj can be naturally arranged in a triangular 
array of $(n + 1)(« + 2) numbers. [In contrast, the a {j in (1) formed a 
rectangular array of (m + 1)(« + 1) quantities.] We shall show that with 
an appropriate “triangular” array of points, (x fc , y 7 ) 9 the interpolation 
problem for polynomials of the form (8) can be uniquely solved. 

Let (xj and {>>/} be two sets of n + 1 distinct points, where j = 0, 
Then we consider the array of points: 

(9) p k j — (X;<., y j), j -h k. 0, 1,..., w. 


[This array is actually triangular only if the values x j and y j are ordered by 
yand uniformly spaced, which we do not assume.] There are + l)(/i + 2) 
such points and with them we pose the interpolation problem: find a 
polynomial in x and y of degree at most n such that P n (x k9 y } ) — f(x k9 jy) 
for 0 < j + k < n. Newton’s divided difference formulae easily yield the 
solution of this problem. We obtain, upon replacing m by n in (4) and n 
by n — k in (5): 


fix, y) = PJx, y ) + R n (x, y) 
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( 10 ) 

and 


P n (x, y) = 2 2 "fc-i (*)«/- iOO/[*o, • • •, Jo, • • •. J'j; 

fc = 0 7 = 0 


Rn(x, y) = 2 "fc-l(*)"n-»O , )/[*0. • • •, *k5 yo, ■ ■ J'n-fc, y] 

fc = 0 

+ w n (x)/[x 0 , ..x n , x; y]. 


( 11 ) 


= 2 


I d Y( 8 V k+1 


* = o k\(n — k + 1)! 


(e)U)’ 


/(fk> ^/c)- 


The polynomial (10) has degree at most n . If we assume the indicated 
partial derivatives of f(x, y) to be continuous, then (11) yields the error 
which vanishes at all points in (9). The uniqueness of this polynomial 
follows from Problem 3. Thus the interpolation problem is solved by a 
polynomial of the form (8) on any set of points of the form (9). If these 
points are equally spaced and in monotone order, then the polynomial 
(10) can be simplified by introducing difference operators (see Problem4). 
The remainder term in (11) is now analogous to that in Taylor’s formula. 
In fact, if we let jc v -> x 0 and y v -> y 0 for v — 1, 2, 3 ,..n then (10) 
formally goes over into the Taylor expansion. 

To approximate the partial derivatives of functions of several indepen¬ 
dent variables we could proceed as in Section 5 for functions of one vari¬ 
able. By using the error expressions of the form (7b) we could also obtain 
representations for the error in these numerical differentiation methods 
(if the function is sufficiently differentiable). However, in practice it 
turns out that relatively low order approximations to partial derivatives are 
usually all that are required. In these circumstances, it is easy to use the 
method of undetermined coefficients or the Taylor expansion method 
developed in Section 5. If no mixed derivatives occur and the points to 
be used are on a coordinate line in the direction of differentiation then the 
one-dimensional analysis is valid. For mixed derivatives the points em¬ 
ployed in the expansion procedure must not be collinear. In Chapter 9, 
where partial differential equations are treated, specific applications are 
made in several examples. 


PROBLEMS, SECTION 6 

1. Show that every polynomial Q(x r y) of degree m in * and n in y which 
vanishes at the {m + 1)(az + 1) distinct points (*,, y,); i - 0, 1 
j = 0, 1,. . n; vanishes identically. 
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Q(x, y) =22 a vu x' , y u = 2 b,(y)x\ 

V-0 M — 0 V = 0 

Set y = yj and then note that the polynomial Q(x> y f ) of degree min x vanishes 
at m + 1 distinct points. Thus, b v {yi) = 0 for v = 0, 1,.. m. Next show that 
all a Vfi = 0.] 

2. We define the difference operators A* and A y by: 

A*/(X jp) = /(* + K y) - /(*, y\ 

A y /(*, y) = /(*, y + k) - f(x, y). 

If Xi = x 0 + ih and y s = y 0 + jk then show that the interpolation polynomial 
of degree m in x and n in y for f(x , y) using the points ( x t , yj)\i = 0, 1,.. m; 
j = 0, 1,..., n is: 

P(X 0 + sh y y Q + tk) =22 ^ A* v Ay y /(x 0 , j> 0 )- 

3. State and prove the analog of Problem 1 for polynomials in x and y of 
degree at most n using points of the form (9). 

4. Use the difference operators of Problem 2 and special equally spaced 
points of the form (9) to derive from (10): 

Pn(xo + sh, y a + tk) = 2 2 - A* v Aj,“/(.*: 0 , yo)- 

V = 0 U= 0 V • f*" 

Define the corresponding backward difference operators V* and V y ; use them 
to write an interpolation polynomial of degree n in the plane; describe the set 
of interpolation points employed. 




7 

Numerical Integration 


0. INTRODUCTION 

Simple explicit formulae cannot be given for the indefinite integrals of 
most functions. Furthermore, in many problems the integrand, /(x), is not 
known precisely but perhaps is given by tabular data or defined as the 
solution of some differential equation (which cannot be solved explicitly). 
Thus, we seek appropriate numerical procedures to approximate the value 
of the definite integral, say 

(1) /{/} — [ b f(x) dx. 

Ja 

Unless otherwise specified [a, b ] is a finite closed interval. 

The types of approximation to (1) that we shall consider are all essentially 
of the form 

(2) Uf} m J «,/(*,)• 

/= 1 

When employed as an approximation to an integral, a sum of this form is 
called a numerical quadrature or numerical integration formula . For 
brevity, “numerical” is usually dropped. The n distinct points, x„ are 
called the quadrature points or nodes and the quantities a. j are called the 
quadrature coefficients. The basic problems in numerical integration are 
concerned with choosing the nodes and coefficients so that /„{/} will be a 
“close” approximation to /{/} for a large class of functions, /(x). As with 
polynomial approximation we note that different criteria may be used to 
measure the quadrature error , 

E n {f} = /{/} - I n {f) 

even though it is a scalar; these criteria suggest different types of quadrature 
formulae. 
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One particularly useful notion which measures the error of a quadrature 
formula is its so-called degree of precision ; this is by definition the maximum 
integer m such that E n {x k ) = 0 for k = 0, 1,.. m but £ n {jc m + 1 } ^ 0. 
Thus, if a formula has degree of precision m all polynomials of degree at 
most m are integrated exactly by that formula. 

In fact, an expression for the error E n {f} of such a scheme is given in 

theorem 1. If (2) has degree of precision m and f(x) has a continuous 
derivative of order m - l-l, then 

(3) EJf) m /{/} - /„{/} = (/w } jy, JV + 'KMn.mU) ft 

where 

G n .JO = (m+ !)[/{(* - ?) + m } - /„{(* - 0 + m )l 

with 

( X _ n - = /° * * t 

and [c, d] is the smallest interval containing [ a , b] and all x 
Proof . By Taylor’s theorem (with remainder) 
fix) = TJx) + RJx), 

where 

TJx) = J - c)\ 

k= 0 K ■ 

. J c 

Clearly, 

(4) RJx) = T f/ ( -" +1) (0(x - 0 + m ft. c<x<d. 

ml J c 

I{ } and / n { } are linear operators .f Hence, since / n { } has degree of 
precision m, I{T m } = I n {T m } and 

E n {f} - /{* B } - /„{*„}. 


But for R m (x) as given in (4), we find the expression (3), by interchanging 
the order of the operations on the variables x and £. ■ 

t J {} is called a linear operator iff for all scalars a , b and functions /(*), £(*) 

J{af(x) + bg{x )} = aJ{f } + bJ{g). 
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It is left to Problem 1, to show that G n m (c) = G nm (d) = 0. In the follow¬ 
ing sections, simpler expressions than (3) for the error are found in special 
cases. 

If a “close” approximation to f(x) in a < x < b is known, then the 
integral of the approximating function will be “close” to the integral of 
f(x). That is, if 

I/M - gMI ^ 

then 




< \b — a |e. 


This simple result is the motivation for developing most numerical 
integration methods. Of course, it is desirable that the approximating 
function should have a simple explicit indefinite integral. Hence polynomial 
approximations are naturally suggested and of these the interpolation 
polynomials are most frequently employed. Although there are quadrature 
formulae of great utility which are not necessarily motivated by the use of 
simple interpolation polynomials, we shall see nevertheless that all such 
methods of general value are what we will call interpolatory. There is con¬ 
siderable freedom to choose the position of the interpolation points relative 
to the interval of integration, so as may be expected there are a large 
number of numerical integration formulae. The choice of which formula 
to employ in a given case should depend upon its accuracy and relative 
ease of application. 

In Sections 1 through 4, we consider simple quadrature formulae; in 
Section 5 we treat composite quadrature formulae. A composite formula is 
obtained by applying a simple formula to successive subintervals of 
[a, b]. In this fashion the problem of uniformly approximating the inte¬ 
grand f{x) over [ a , b] is treated by using polynomials of a fixed “low” 
degree over each of the “small” subintervals into which the interval [ a , b] 
is divided. 

In many integration problems the integrand cannot be accurately 
approximated by a polynomial. Such cases may arise, for example, if 
f{x) is discontinuous at some points of the interval. Special considerations 
are required in these problems and we study some of them in Section 6. 

In Section 7, we briefly treat the subject of approximating multiple 
integrals where the current state of the theory is not fully developed. 


PROBLEMS, SECTION 0 


1. Under the conditions of Theorem 1, show that 
G n . m {c) = G n>m (d) = 0. 
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Let n + 1 distinct points x j be ordered by 

x 0 < x x < * * * < x n . 

With these points as interpolation points, we form the interpolation poly¬ 
nomial P n (x) of degree at most n [for the continuous function f(x)] such 
that /(*,) = P n (Xf)' J = 1 ,. . n. Then as an approximation to the 

integral (0.1), we set 

(1) /n + l{/} = j'PnWdx. 


This integral is easily evaluated. In fact, by using the Lagrange form for the 
interpolation polynomial 

(2a) P n (x) = ^ 4>n.Ax)f{x j ) 




./ = 0 , 1 


n. a-/ “ {x _ Xj ) Wn '( Xj ) 

where o> n (x) = (x — x 0 )(x — JCj.)- * • (jc — x n ) 9 we obtain from (1) the 
quadrature formula 

(3a) /n + l {/} = 2 W n.if(Xj), 


with the coefficients given by 


Vn.j = f 4>n.j(x) dx. 
J a 


It is clear that the coefficients w n>j are determined completely by the end¬ 
points of the interval of integration and by the interpolation points x j9 
which also are the nodes of the formula (3a); the coefficients are indepen¬ 
dent of the integrand. Any quadrature formula of the form (3a and b) 
is called an interpolatory quadrature formula . 

The error in approximating the continuous function f{x) by P n (-*) is, 
by (1.8) of Chapter 6 and the definition of cu n (x) above, 

fix) - Pnix) = Oi n {x)f[x 0 , ...,X n , x]. 

Integrate this equation over [a, b] and use (0.1) and (1) to obtain the 
interpolatory quadrature error 

(4a) E n + 1 {f} = /{/}-/, + 1 {/} 



.., x n , x] dx. 
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If f{x) is a polynomial of degree n or less then £ n + 1 {/} = 0 follows from 
the corollary to Theorem 1.1 of Chapter 6. Thus we have shown that any 
interpolatory quadrature formula using n + 1 nodes has degree of precision 
at least n. We shall, in fact, see later that even higher degrees of precision 
are possible if the nodes are specially placed with respect to the interval of 
integration. From (4a) a simple error bound is found, 

(4b) |£ n + 1 {/}| < max | f[x 0 , x u ..x n , x}\ f \w n (x)\ dx, 

*e[a,6] J a 

where without loss of generality we assume a < b. 

To examine the error in more detail let us consider the special case in 
which w n (x) does not vanish in the open interval ( a , b) [i.e., aj n (x) does not 
change sign there]: 

If f\x) is continuous on [a, b ], it follows that/[x 0 , ■..,^^1 is continuous 
on [■ a , b] by Theorem 1.3 of Chapter 6. Thus the mean value theorem for 
integrals can be employed in (4a) to yield 

(5a) E n+1 {f} = f[x 0 , ...,x n ,Tj] f <u„(x) dx, a < v < b. 

J a 

If, in addition, / (n + 1) (*) is continuous on the smallest closed interval, 
[c, d], containing [a, b] and the nodes {*,}, then by (1.9) of Chapter 6, 
(4a) becomes 

(5b) E n + 1 {/} = £ r - j L_ ! £ o> n (x)f « + »(|) dx, t m £(x) 6 (c, d). 

Now apply the same mean value theorem to (5b), where w n (jc) is of one 
sign and / (n+1) (£(V)) is continuous in jc because 

/ <n+ ”(£(*)) = {» + x u ..., x n , x], 

to find 

f in+1 HO C b 

(5c) E n + ,{/} = Wn(x) dx> j 6 ( C , d ). 

In the general case, some nodes will lie in the interval of integration and 
the simple error formula (5c) is not generally valid. Our aim is to find a 
suitable replacement for (5c). 

Specifically let there be r — 1 >0 interpolation points or nodes in the 
open interval {a, b ), as in Figure 1, 

x 0 < • * • < x t < a < x i + l < < x i + r -1 < b < x i + r <•••<*„. 

For convenience of notation we introduce the r + 1 quantities, 

& y £k + k> k 1 , 2 ,...,/* 1 , = b. 
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xo XI ••• x, x l+1 *** x l + r -1 x i + r Xn x 


Figure 1 

Now the error expression in (4a) can be written as 

En + i{f} = f r W„(x)/[.v 0 , ...,x n ,x] dx 
Jt 0 

* r ik 

= Z t»n(x)f[x 0> ...,x n ,x]dx. 

k= 1 •'fk - 1 

In each of the intervals f te ] the quantity w n (x) is of one sign and it 

changes sign at the points £ u • • •» i T -i* Thus as in the derivation of (5a) 
we now conclude that 

r 

(6a) E n + i{f} ^ C k f\x o,..x n , ik-i ^ Vk ^ 

/c = i 

where 

r c fc 

(6b) = w n (x) dx , /c — 1, 2,..r. 

If/ Cn+1) (*) is continuous in [a: 0 , x n ] then, as in (5c), we obtain 

£n + i{/} = 2 (w + i)! / (n+1, (^)’ < C k < x n . 

However, it is clear from (6b) that the constants C k alternate in sign; 
that is, 

sign C k + 1 = sign [-C k ], k = 1, 2,. . ., r - 1. 

So the last form for the error can be written as two sums, each with co¬ 
efficients of the same sign: 

f {Ci/ <n+1) «i) + C 3 / (n+1) « 3 ) + ■ • •} 

(6c) £; + 1 {/} = ^ 

[ + {c 2 / {n+l) a 2 ) + c 4 / (n+i) (Q + •••}• 

To simplify further, we require the following: 

lemma 1. Let g(x) be a continuous function in [a, b ] and let a 1# a 2 ,. .a n 
be any set of non-negative numbers such that 

n 

2 “* = A - 

k = 1 
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Then for each set of n points x k e [. a , b] there exists a £ e [a, b ] such that 

n 

2 *kg(Xk) = Ag(0- 

k= 1 

Proof Since g(x) is continuous on a closed interval, it has there a 
finite maximum, M , a finite minimum, m , and actually takes on these 
values and all intermediate values as x ranges over [a, b]. Thus for each 
of the x k 

m < g{x k ) < M, k — 1, 2,.. 

Since the numbers a k are non-negative, this implies 
a k m < cc k g(x k ) < cc k M. 

Sum these inequalities for k — 1, 2,to find 

n 

Am < 2 a kg( x k ) < AM. 

k = 1 

Hence the value of the sum must be equal to Ag(£) for some £ e [a , b ]. ■ 
It should be observed that this lemma is analogous to the mean value 
theorem for integrals and that this proof copies the usual proof of that 
theorem. 

Returning to (6c) we now have, by an obvious application of Lemma 1 
to each of the sums in brackets 

(7a) E n + 1 {f} = 2 [K 0 f (n + 1 \U - + 

where 

(7b) K 0 = C x + C 3 + • ■ ■, K e == -C 2 - C 4 -, 

* 0 ^ So ^ x n . 

The constants K e and A" 0 //ze jwne sign and so some cancellation is 
suggested by (7a). In fact, since 

\K. - K.\ = I 2 Cl = I f “nW dx I, 

; = 1 •'a 

the error expression (7a) formally reduces to (5c) if t> 0 — S e = In general, 
S 0 and £ e are unknown points and / (n + 1) (x) may change sign in (x 0 , x n ) 
so that the above reduction in the error may not occur. However, there are 
important special cases where, in fact, this maximum suggested cancellation 
does occur. We shall consider them later (see Theorem 2). 

If the interpolation points or nodes are equally spaced, the above results 
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can be modified to exhibit the dependence of the error on the spacing of 
the points. That is, let the interpolation points be of the form 

x, - x 0 4- jh, j - 0, 1,.. n, 

and introduce the change of variable from x to t , 

x = x 0 + th. 


Now, from (3.10) of Chapter 6, we have 

= h n+1 7T n (t), 

and the integrals C k in (6b) can be written as 

C k = h n+2 B k 


where 


+k 

’ r n (t)dt, k = 2, 3 ,..r - 1; B 1 

i + k-l 

r* o 

B r =\ "n(t)dt. 

Ji + r-l 


The limits of integration t a and t b are given by 


ta = 


a - x o 


h 


b - x 0 . 
/* ’ 



77 n (0 


and lie in the interval 

i < t a < i + 1, / 4- r - 1 < t b < i + r. 

Since 7r n (/) is of one sign in these intervals, B 1 and B r can be bounded by 


l*i 




7T n (t)dt\, 


and these bounds are independent of h. By using these results in (7), we 
obtain the error representation 

(8a) E n + 1 {f} = [L 0 f (n + 1) (U ~ W <n+l) (£«)], 

where x 0 < £ e , £ 0 < jc n and 

(8b) L 0 = B 1 4- B 3 4- * * *, L e = — B 2 — . 

The constants L 0 and L e have the same sign and, in fact, 
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We have thus shown, in general, that if f n + 1 (x) is continuous in the 
smallest interval , [c, d ], containing all the x,-, a an*/ b , r/7Crt a/i interpolatory 
quadrature formula which uses n + 1 equally spaced nodes , < 3 / spacing h, 
has an error of the form (8). Furthermore, by (8) or (4b): 

|£ ntl {/}| < ( h n , f “ K(f)l dt max |/ (n + 1) (£)l- 

This estimate is valid independent of the location of the nodes relative to 
the interval of integration. We give the analogue of formula (5c) which is 
valid in the special case that (< a , b) contains none of the uniformly spaced 
points {xj: 

(8c) E n + 1 {f} = f ^n(0 dt, t e (c, d). 

In the next subsection, we treat the Newton-Cotes formulae, and we 
find representations of E n + i{/} that are of the form (5c) or (8c), even though 
o» n (x) changes sign in (a, b). 


1.1. Newton-Cotes Formulae 

Let the interpolation points, x jy be equally spaced, say as before, 

(9a) Xj = x 0 + jh, j = 0, 1, ..., n ; 

but now let the endpoints of the interval of integration be placed such that 

(9b) x 0 = a, x n = b, h= 

n 

With this choice of nodes the quadrature formula (3) as an approximation 
to the integral (0.1) is called a closed Newton-Cotes formula. Note that 
all of the nodes are in the integration interval [a, b] and the word “ closed ” 
means that the endpoints a and b are the extreme nodes of the formula (3). 

To examine the error, we again introduce the change of variable 
x = x 0 + th and obtain w n (.v) = h n+1 7T n (t). Now, however, t ranges over 
the interval [0, n] and so we deduce properties of co n (x) over [<a , b] analogous 
to those of 7 T n (t) developed in Lemmas 3.1 and 3.2 of Chapter 6. With the 
notation 


2 


a + b n . 

~ 2 ~ ~ x ° + 2 h ’ 


these properties are restated in 
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LEMMA 2. 


^nC^n /2 + 0 — (— l) n+ 1 Oi n (x nt2 ~~ 0. 

■ 

LEMMA 3. 

(a) For a < £ + h < x nl2 and £ ^ x,, j = 0, 1,..., w: 



Mf + h )| < hn(f)|; 


(b) For x nj2 < f < b and £ ^ x jt j = 0, l,n: 

K(0l < I«•.(* +*)|. ■ 

Let us introduce the functions 

(10) Q„W= n = 1,2,..., 

Ja 

which will be used to estimate the error in the closed Newton-Cotes 
formulae. For these functions, we have 

lemma 4. For n even 

(a) Q n (a) = Q n (b) = 0; 

(b) Q n (*) >0, a < x < b. 

Proof. From the definition (10) it follows that Q n (a) — 0. Since n is 
even, by Lemma 2, the integrand in H n (6) is antisymmetric about the mid¬ 
point of the interval of integration and hence H n (6) — 0. 

For part (b) we observe that a, x x , x 2 ,..x n _i, b are the only zeros 
of cj n (x), and hence o> n (x) < 0 for x < a (since w n (;t) is of odd degree). 
Then o> n (x) > 0 for a < x < x l and thus 

Q n (jc) >0 for a < x < x x . 

But by Lemma 3, we see that the negative contribution of cu n (jc) over 
[Xl, x 2 ] to ^ n (x) is in magnitude less than the positive contribution over 
[a, Xj]. Therefore, 

> 0 for a < x < jc 2 , 

This argument can be repeated to cover the interval a < x < x n , 2 . For 
x > x n/2 , Lemma 2 is employed. ■ 

Notice that these arguments can be used to yield 

lemma 5. For n odd: 

(a) Q n (a) = 0, Q n (b) = 2H n (x n/2 ); 

(b) f2 n (x) < 0, a < x < b. ■ 

However, we shall not require this lemma in our analysis of the error in 
quadrature formulae. 
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We are now prepared to estimate the error, E n + l , given by (4a), for the 
closed Newton-Cotes formulae. We first treat the case of n even and assume 
that the integrand,/(x), has n + 2 continuous derivatives. By using (10), 
integration by parts [note that the continuity of ( d/dx)f[x 0 , x u .. ., x n , x] 
is assured by Problem 1.6 of Chapter 6], and Lemma 4a, the error is 


w/} = £ 


di\(x) 

dx 


f[x 0y ..., x n , x] dx 


= Qn(*)/[*0, ■■■,**, A'] 


C b d 

J dx ^ X °' • dx 


Hence, 


f 6 d 

= - I faf[ X 0> ■■;X n ,x] dx. 


r» f ,n+2> (£(x)) 


(» + 2)! 

(from Problem 1.7 and Corollary 2 to Theorem 1.2, all of Chapter 6). 

Now 

/ <n + 2, (£M) - Yx f[x 0’ • • •’ *”> *X» + 2)! 

is continuous by Problem (1.6) of Chapter 6. By Lemma 4b, Q n (x) > 0. 
Hence, we may apply the mean value theorem for integrals in the above 
to get 

/ (n + 2) (7?) rl> 


E r . + l {f} = - 


P J &n(x)dx. 


a < v < b. 


(n + 2) 

In addition, integration by parts and Lemma 4 yield 

J iljx) dx - xO n (x) - J x ^ dx 

= — xoj n (x) dx > 0. 

J a 

These results have established 


theorem 1. Let the points of (9) divide [a , b } into an even number of 
equal intervals. Let /(x) have a continuous derivative of order n + 2 on 
[a, b]. Then the error , (4a), in the closed Newton-Cotes quadrature for¬ 
mula ., (3), for n even is 

£ n ♦ i {/} = jffY)\ f{n +2>(7?) ’ a <v<b; 
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K n == f xo> n (x) dx < 0. ■ 

Ja 

We deduce from this theorem the interesting result that the closed Newton- 
Cotes formula with an even number, n, of intervals has degree of precision 
n + 1 (even though the interpolation polynomial employed is of degree n). 

To treat the case of odd n , we could employ Lemma 5. This would lead 
to an error expression containing two terms involving different order 
derivatives of f(x). However, to obtain the simpler form of Theorem 1 
we first recall that o> n (;c) does not change sign in [b — h , b\ Then (4a) 
yields by the mean value theorem for integrals and (1.9) of Chapter 6, 

En +1 {/} = U> n (x)f[x 0 , ..x n , x]dx + aj n (x)f[x 0 , ...,x n ,x] dx 

Ja Jb-h 


-r 


“»»(*)/[* 0 ,.. x n , x] dx + 


(« + 1 ) 


) r 

<o n (x)dx, 

I Jb-h 

a < f < b. 


To treat the first integral, we write 

w„(x) = - x n ) and iin-iW = | o>„-i ii)dt 

Ja 

Then the properties of divided differences given in (1.5) and (1.6) of Chapter 
6 permit 

rb-h C b-h 

J Wn(x)f[x 0 ,... , x n , x] dx = J — ^Zx ^ Xo ’ ''' ’ Xn - u 

- f[xo ,*„]) dx 

Now « — 1 is even, and so Q n _ l (fl) = — h) = 0, or 

r k %^w= 0 . 

Ja dx 

Hence we may neglect the integral involving the constant f[x 0y .. ., x n ]. 
For the remaining integral, an integration by parts and application of the 
mean value theorem for integrals as before yield 

rb-h f(n+l)(£"\ rb-h 

J ‘“ n (*)/l*o, • • •, X n , x] dx = - - jyf j - i(x) dx, 


Thus we have deduced that 


a < £' < b. 


E n + l{f} = -M/ <n+ 1 , (f) + fi/' n+ 1 ) (n], 
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A = -jrnji i-. dx ' B = cmyi L dx ■ 

However, since x = 6 is the largest zero of co n (x) and cu n (x) > 0 for 
x > b, it follows that co n (x) < 0 in [b — h, b ], and so ^ > 0. That 
5 > 0 follows from Lemma 4, since n — 1 is even. Thus, if / (n + 1) (*) is 
continuous on [a> b] an application of Lemma 1 implies that there exists 
a point i in [f # , £"] such that 

£ n + i{/} = -(A + 

Since 


w nW — -^ 

we have through integration by parts and Lemma 4 


r 


“»»(*) rf* = £i„_i(*)(x - 6) 


r 


H n -iW dx 


-f 




Thus 


= - 


1 


(» + 1 ) 


•, f co n (x)dx. 
• J a 


In summary, we have 

theorem 2. If the points of (9) divide [a, b] into an odd number of equal 
intervals and f{x) has a continuous derivative of order (n + 1) on [a , b] 
then the error , (4a), in the closed Newton-Cotes quadrature formula (3), 
for n odd is 

£n + i {/} = 7 ^W (n+1 m a < i < b; 


where 


(« + l)r 


K n = J o.„(x) 


dx < 0. 


The formula covered by this theorem has degree of precision n. Note that 
the result in Theorem 2 is formally similar to that in (5c). 

To express the dependence of the errors given in Theorems 1 and 2 
on the interval size, h , we use the change of variable, x — x 0 4- th , and 
find 
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corollary. Under the hypotheses of Theorems 1 and 2, respectively , 

A „ + 3/( n + 2, (a Mn _ ,*.(,) d , < 0, 

n even; 

(TT1)1** +2/< " +U(a A/ » s / 0 " *■«(') * < o, 

n odd. ■ 

Since the closed formulae are exact for polynomials of degree at most 
n + 1 when n + 1 is odd, and are exact for polynomials of degree at 
most n when n + 1 is even, it is generally preferable to employ the odd 
formulae, i.e., those with an odd number of interpolation points. Also, 
it clearly does not pay, in general, to add one point to a scheme with even 
n\ rather, points should be added in pairs. 

Another useful integration formula with equal intervals is found by 
using the points 

(12a) Xj = * 0 4- jh 9 j - 0, l,..., n\ 

where 

(12b) h = ^ x 0 = a + h, x n = b — h. 

The endpoints are then labeled x- x — a and x n + 1 — b. Since we do not 
employ the endpoints in formula (3), it is now called an open Newton-Cotes 
formula . In this procedure, all n + 1 points of interpolation are interior 
to the interval of integration. 

To examine the error we introduce, in place of H n (x), the functions 
/ n (^) defined by 

(13) J n (x) = fai n (f)rff, if = 1,2,.... 

J a 

These differ from the functions in (10) since now a < x 0 and x n < b. 
However, as in the proof of Lemma 4, it follows that for n even 

Jn{a) = d n (b) = 0; J n (x) <0, a < x < b. 

Then, exactly as in the derivation of Theorem 1, we have 

theorem T. Replace “(9)” by “(12)” and “ closed ” by “open” in the 
statement of Theorem 1 . Then the formula for E n + J/} becomes 

£ n + i{/} = ( ~-" 2 ) , / <n+2> (0. a < { < b; n even, 


(11) £ n + 1 {/} = 


where 


K n ' = f xaj n (x) dx > 0. 
J a 
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Similarly, for n odd, by procedures analogous to those for closed-type 
formulae, we find 

theorem 2'. Use the hypothesis of Theorem 1', but with n odd. Then 
£» + i{/} = (/t ^’ 1 1)! / ( ’ , + 1 m a< i<b ; n odd, 

where 


K n 


r 


(x) n ( jc ) dx > 0. 


These errors for the open formula may be expressed in terms of the spacing 
h as 


corollary. Under the hypothesis of Theorems V and 2', respectively , 


£n + l{/} = 


M' 


h n+3 f (n+2) (Z), 


(14a) 


M, 


(n + 2)! 

/*n + 1 

»' = J ^ tir n (t) dt > 0, n 


even; 


(14b) 


£ " +l{/} = (T^!* n+2/<n+1>(a 


/•n +1 

= J 77- n (0 dt > 0, « odd. 


Again we find that for n even, the degree of precision is n + l, while for 
n odd it is only n. A comparison of (11) and (14) indicates that only the 
values of the coefficients M n and Mf differ in the form of the error estimates 
for open and closed Newton-Cotes formulae based on the same number, 
n 4- 1, of nodes. [However, for any fixed number of intervals in [a, b] 
say m, the closed formulae use m + 1 nodes and the open formulae use 
m — 1 nodes. Hence, the closed method has a degree of precision two 
more than the open method on this basis, but requires two more evaluations 
of the function f(x).] 

There are useful quadrature formulae that are neither open nor closed 
but which have uniformly spaced nodes [e.g., see Problems 2, 3, and 5]. 
The formulae of Problems 2 and 3 are the basis for Adam’s method for 
the numerical solution of ordinary differential equations (see Table 2.1 
of Chapter 8). 
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1.2. Determination of the Coefficients 

Let the interpolation points be equally spaced, say of the form x 5 = 
x 0 + jh, j = 0, l, .. n, and let the endpoints a , b of the integration 
interval also be of this form, say a — x p = x 0 + ph and b = x q = x 0 + qh , 
(but not necessarily interpolation points). Then the coefficients, w n y , for 
quadrature formulae of the form (3) can be written as, using x ~ x 0 + th , 


wv* = 



dx 


r* » t - A: , 

=*I n TTTfc dt 

Jp fc=0 j K 
ik*j) 


j = 0, 1,..». 


Thus if we define the quantities 

(15a) A n-i (p, q) = - T^j dt ’ J = - «! 


the coefficients are simply 

(15b) w ntj = hA nJ (p, q). 


In the special case of the closed Newton-Cotes schemes we have p = 0 
and # = n; for the open schemes of Section 1.1, p = — 1 and q = n 4- 1. 
It should be noted that the A n j (p , </) are independent of the spacing, h 9 
and thus may be tabulated as functions of the parameters n , j\ p y and q. 
Appropriate coefficients for a particular spacing, h, are then determined 
by using (15b). Since the integrand in (15a) is a polynomial of degree n 
the A nJ (p,q) will all be rational numbers for rational p and q. Further, 
we note that A nt} (0 y n) — A ntn -,( 0, n) and more generally, A n<j { — r, n + r) 
~ /4n.n~/ —r, n + r) for any r. For p and q of these important special 
forms, we need only tabulate the values for j < n/2. Tables 1 and 2 list 
the simplest closed and open Newton-Cotes formulae. The coefficients 
A n f (n y n + 1) are to be found in Problem 6, for n = 1, 2, 3, and 4. 

An alternative indirect procedure, called the method of undetermined 
coefficients , can also be employed to determine the coefficients for the 
quadrature formula (3). This method is quite practical for unequally 
spaced nodes as well as for fairly large values of n . In addition, it can be 
used to prove 
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Table 1 Closed Newton-Cotes Formulae 


dx = 


r*i 

fix) 

Jxo 
r x 2 

fix) dx 

r x 3 

f(x) dx 

Jx 0 

C x 4 

f( x ) dx 

Jxn 


j(fo + k ) - Y2 / <2) (^)' X 0 < £ < x lt (trapezoidal rule) 

5 (/o + 4/i + k) /<->(£), X 0 < £ < x 2 , 

(Simpson’s rule) 

■2 K 1L5 

y (/o + 3/i + 3/ 2 + /,) - ^ / (1, (0, *o < f < x 3 . 

0/.7 

g (7/0 + 32/1 + 12 k + 32/ 3 + 7/4) -^/ ,8) (a 


Xo < f < x 4 . 


/(x) 


(/x = 


^( 19/ 0 + 75 /i + 50/2 + 50/3 + 75/4 + 19/5) 


275/T 


12096 

Table 2 Newton-Cotes Formulae 


/ <e, (a * 0 < t < a s . 


/(A) 

f*3 

/(*) 

f* 4 

/« 
I fix) 

JXQ 

r x e 

/« 

* Xn 


</x 

</x 

t/x 

dx 

dx 


2hk + y / <2) (f), x 0 < f < x 2 , (midpoint rule) 

y (/x + / 2 ) + ^ / (2> (0, *o < f < x 3 . 

y (2/i - /, + 2/3) + y£/<«(a A„ < f < x 4 . 

5/? 95/i 5 

^ (11/1 + / 2 + h + 11/4) + “^/ (4) (a *0 < £ < *». 

/r L 41 

^(n/i - 14/2 + 2 6/ 3 - i 4 / 4 + ii/ 6 ) + ^/ (e m 

x 0 < £ < x 6 . 


I* 7 /(A) rfA = (611A - 453/ 2 + 562/ 3 + 562/4 - 453/ 5 + 611/0 


, 5257/i 7 nsut . , 

+ “8640- / W ’ *° < f < * 7 ‘ 


theorem 3. /t quadrature formula which uses n + 1 distinct nodes is an 
interpolatory formula iff it has degree of precision at least n . 

Proof. The necessity has been demonstrated by equation (4b). To 
prove sufficiency, we let the n + 1 distinct points x h j — 0, 1,..n, be 
the given nodes. If the quadrature formula, with these points and coefficients 

a y , has degree of precision at least n in approximating J fix) dx , then we 

must have 
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2 “y = (* - a) 

) ~0 

2 a > x i = - « 2 ) 

y = o 

(16) 

«/*/" = (* n + 1 - «” + 1 )- 

This may be considered as a system of n + 1 equations for the determina¬ 
tion of the n + 1 coefficients, cq. Since the coefficient matrix of the system 
(16) is of the Vandermonde form, and the Xj are distinct, there exists a 
unique solution . On the other hand, note that the interpolatory formula (3), 
with the same points x /9 is exact when applied to the powers 1 , x, .. x n . 
Hence, the system (16) is satisfied with the a j replaced by the w n j . The 
fact that (16) has a unique solution shows a y = w ntj . ■ 

We shall see that most of the popular quadrature formulae are inter¬ 
polatory. This follows by means of Theorem 3 and its extension, in Section 
4, to weighted quadrature formulae. 

The method of undetermined coefficients consists in solving the system 

(16) for the a ; . To determine the degree of precision of an interpolatory 
quadrature formula, we simply form the quantities 

En + iW he £-L_ (**♦!_ „*♦ 1) _ 2 apcf, 

k — n + 1 , n 4 - 2 ,...; 

and determine the least integer k such that E n + 1 {x k } # 0. The degree of 
precision is then k — 1 . 

If it is known that the error in the integration formula has the form 
£n + i{/} — for some integer m , then we can determine the 

coefficient A m by this method. That is, we must have m — k where k — l 
is the determined degree of precision. Hence, use f{x) = x k to get 

(17) E n + 1 {x k } = 7 -!- ( 6* +1 - J “/*/ = *! A k , 

K 1 j = 0 

which can be solved for A k . For example, in (11) for the closed Newton- 
Cotes formulae, we may use the quantities A n + 2 or A n + l to evaluate 
the appropriate coefficient M n . 

As an illustration of the application of the method of undetermined 
coefficients, we consider the closed formula with one segment, n = 1 , 
and the two nodes x 0 = a and x x = b . The system (16) now becomes 

“o + «i = (b - a); 
aa o 4- ba x = \{b 2 - a 2 ); 


2 

i = ( 
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and the solution of this system is 

«o = = \{b - a). 

To determine the degree of precision we apply the formula to x 2 , x 3 ,.. 
and get first 

E*{x 2 } = K * 3 - « 3 ) - W - a)(a 2 + b 2 ) 

= -i(b - af * 0 . 

Thus the degree of precision is 1, as we knew since it is a closed Newton- 
Cotes formula with n — 1. The error can be written as 

E 2 {f} - A 2 f< 2 \0 

where A 2 — (Mi/2 !)/! 3 and h — b — a. By using/(x) = x 2 we find that 
E 2 {X 2 } =-Kb- af - 2 A 2 . 

Thus A 2 = —T 2 '(^ — #) 3 and M x = — The formula determined above 
is the familiar trapezoidal rule which can now be written as 

(18) £7(x) [/(a) + /(*)] - ( -^ 2 ^"/ <2> (f). 

a < £ < b. 


PROBLEMS, SECTION 1 

1. Add to the hypothesis of Lemma 1: r of the coefficients (a*) are non-zero. 
What is the smallest integer r for which the stronger conclusion f e (a, b) 
is valid? 

2. From equation (4.19) of Chapter 6, derive the formula 


J* + * m dt = h[do f(x) + d t Af + dj*f + • • • + 

where d 0 = 1, and recursively 


w 4-i . <4-2 I / lyt dp 

dk 2 + 3 + + ( k + 1 


= 0 , k = 1 , 2 ,. . m + 1 ; 


x < £ < x + mh; A k f = J ( — 0“ + y'A). 

/ =• 0 \J l 

[Hint: The coefficients {d { } satisfy the identity in A, 


A = log (/ + A)(</„/ + d x A + • • • + */ m A m + • * •)■] 


Check the following listing: 


k 

0 

1 

2 

3 

4 

5 

d k 

1 

i 

\ 

~T2 

1 

24 

1 9 
“ T2 0 
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3. From equation (4.19) of Chapter 6 , derive the formula 
f m di = h[e 0 f(x) + eAf + e 2 A 2 f + • ■ • + eJPf) 

Jx-h 

4- h m + 2 e m + if (m + 1 \rj), where e 0 = 1, 

and recursively, 

e k -^ + = (- 1 )\ k = 1,2,m + 1 ; 

x — h < 7} < x 4 * mh\ A as defined in Problem 2. 

[Hint: The coefficients {e { } satisfy the identity 
A - A 2 + A 3 + •. . + (— l) m A m + 1 + * ■ * 

/ A 2 A 3 \ 

= (a - ~2 + T + * * -)(*o/ + eA + e 2 A 2 + • • •)•] 
Check the following listing: 


k 

0 

1 

2 

3 

4 

5 

e k 

1 

-i 

5 

1 2 

-i 

2 5 1 
7T(T 

-9 5 


4. With {dt} and {e t } as defined in Problems 2 and 3, and e- x =0, show that 

dm = “F -1* 

5. From Everett’s form of the interpolation polynomial, (3.17) of Chapter 6 , 
define the coefficients {r { } and C m of the formula 

t f 1 f(x) dx = r 0 (/o + fi) + ri(8 2 / 0 4- S 2 /i) 4- • • • 

" JXQ 

+ aS 2 7o + S 2m /i) 4- C m A 2m + 2 / (2m + 2) (£), 

where 

*-m < £ < x m + u = * 0 + jh. 

Explicitly find r 0 and r x . 

6 . Make a table listing the coefficients A ntj (n, n 4 - 1), for n = 1,2,3, 
and 4. [Hint: Use the result of Problem 3.] 

7. Prove Lemma 5. [Hint: Let n = 2m 4* 1 and 0 < e < -J-. Show that 
|t r n (t 4- l)/ 7 r n ( 0 |< 1 for t = m - e.] 


2. ROUNDOFF ERRORS AND UNIFORM COEFFICIENT FORMULAE 

In almost all practical applications of integration formulae of the form 
(0. 2) the exact function values, /(x), will not be available. This fact is 
usually due to limitations in the calculation of these function values (or 
in their measurement). Thus the quantities actually employed may be 
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written as /(x y ) = /(jc ; ) + p j where p 5 is the local roundoff error made in 
computing (or error in measuring) The error in approximating (0.1) 
by (0.2) with these values is then 

f fix) dx - a if( X j) = EJfi - R-niPh 

Ja j = 1 

where E n {f} = /{/} — I n {f) is the quadrature or truncation error and the 
accumulated roundoff error is: 

(1) *.{/} = 2 «,p,- 

i = i 

If, as is frequently the case, we know that the local errors are bounded, 
say j Pj | < p for j = 1,2,..., «, then 

(2) \K{f}\ < P J |«y|. 

1 = 1 

Let us also assume that the formula (0.2) has degree of precision > 0, i.e., 
that at least a constant is integrated correctly. Then with the integrand 
f(x) = 1 we find 

o) 2 “> = - «)• 

t = i 

Thus if all the coefficients , a ; , are of one sign the bound (2) becomes 
(4) \R n {f}\ < P \b - a\. 

If the coefficients are not all of one sign, then clearly, 

2 w > 12 = ~ a i 

1=1 1=1 

and a larger maximum accumulated roundoff is possible. To attain the 
maximum value requires that 

Pi - P si S n a i 

which is, of course, very special. By comparing (4) and (2), we find a 
practical advantage in having the coefficients of a quadrature formula all 
of one sign, especially if the truncation error is smaller than the rounding 
error. 

By introducing statistical notions of roundoff (or measurement) errors, 
we can, in fact, show that it is of even greater advantage to have all of the 
coefficients of the same value. There are several ways in which the statistical 
notion of “ randomness ” of the local errors can be introduced. For 
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instance, suppose we ask for those coefficients, a jy for which some measure 
of R n {f} is a minimum for all functions f of some particular class, F. 
Since the errors p f = pj{f) usually depend in an extremely complicated 
way on the functions/, direct attempts at such a minimization do not seem 
possible. However, as / varies over the class F the errors p y {/}, can be 
expected to vary in an erratic manner. By making specific assumptions 
about the nature of this variation and introducing a measure of “volume” 
in F we can calculate various “averages” of R n {f) over F. 

Specifically, let us consider for F a one parameter family of functions of 
x, say f(x; r), where the parameter r ranges over 0 < r < T. The roundoff 
error in evaluating f(x k ; r) for each r and k — 1 , 2, .. n will be denoted 
by PfXr). We assume that |p k (r)| < p and that all values in this range are 
equally likely for each value of r; or in particular, we assume that the 
“average” roundoff over the family F vanishes; i.e., that 

1 C T 

(5a) p k = -J^ Pfc ( T ) dr = 0, k = 1, 2,.. n. 

Further, we assume that the roundoff errors at distinct points x j and x k 
are uncorrelated; i.e., 

(5b) f Pj(r)p k (r) dr = 0, if j ^ k . 

Jo 

In effect, this means that the error committed at x y is independent of the 
error at x k for all the functions, f(x ; r) in F. Finally, let us make the 
assumption that all the local errors have the same mean-square value, say, 

(5c) a* = iJ%/(T)</r, j= 1,2, 

Note that a < p. 

We now consider some measures of the accumulated roundoff error 
for the family F. We define for any r in 0 < r < T: 

W = R n {f(x; r)} 

n 

= 2 “>Pi( T )- 
i = l 

The mean-accumulated roundoff for the family F is, by (5a), 

= y. [ RnO) dr = ^ a tP) = 

Thus the coefficients, a jy of the quadrature formula have no effect on the 
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average accumulated roundoff. Next we compute the mean-square roundoff 
error , using (5b and c), to get 

(6) r n 2 ^ ~ j T o R n 2 (r) dr 

1 [T 

= 2, 2, “<“> T Pi( T )pj ( T ) 

i = 1 /= 1 J •'O 

=* 2 i «/• 

;' = l 

The coefficients clearly have an effect on this measure of the roundoff 
error. Let us seek {a y } which minimize the sum in the last line of (6) but 
which also satisfy (3). This problem is easily solved by using the method 
of Lagrange multipliers and we find for the minimizing coefficients: 

n\ b “ a 

(7) = ct 2 = • • • = os n = —-— 


Thus to reduce the root mean-square roundoff error , r n , as much as possible, 
while retaining at least zero order precision, the coefficients should be 
equal. Using (7) in (6) yields for the minimum of r n the value 


( 8 ) 


o\b — a | 

“7T~* 


This result is somewhat surprising as it indicates that the root mean- 
square roundoff error actually decreases as the number of quadrature 
points (and hence of sources of error) increases! It should be noted that 
when the weights are equal the bound (4) applies. Compare the maximum 
bound, p\b — a\, with the statistical result in (8) to find a reduction in the 
dependence on n by the factor l/Vn for the statistical estimate. This is a 
common feature of statistical estimates of roundoff. It is frequently found 
in practice that the statistical estimate is a more realistic approximation 
of the error than is the maximum-type estimate. 

These results can be interpreted in a slightly different, perhaps more 
familiar and intuitive way. We think of a family of calculations of the 
quadrature formula applied to the same function, f(x). Each calculation 
of the family is determined by the particular rounding employed in the set 
of values f(xff j = 1, 2, . . n. That is, p ; (r) is the roundoff error in the 
computation characterized by the parameter value r when computing/(x,). 
Of course, any fixed program for an electronic computer is represented 
by a single value of the parameter. Thus, the intended interpretation 
is not repeating the same calculation, but rather, altering the computational 
procedure slightly each time. The above averages are then averages over an 
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appropriate family of calculations. With this interpretation numerical 
experiments using “randomly” generated and independent rounding 
errors are easily devised. 

We shall find that equal coefficient formulae of the form (0.2) cannot 
approximate integrals of the form (0.1), with degree of precision n for all 
values of n (see, however, Subsection 4.1). 

2.1. Determination of Uniform Coefficient Formulae 

An integration formula with uniform coefficients and n nodes has the 
simple form 

(9a) /,{/} - «n J /(*/)• 

y = i 

In order that this yield the exact result for 

/{/} = JVW dx 
when f{x) = 1, it is clear that 


We now try to determine the n nodes, jc such that (9) has as high a degree 
of precision as possible. Thus we impose the n conditions, / n {x v } = /{x v } 
for v — 1, 2, ..., n to get 

(10) «. 2 JC/ = -1, (6 V+1 - a v+1 ), v=\,2 

If these equations have as solution n distinct real values, x,, then the corre¬ 
sponding quadrature formula (9) has degree of precision at least n , while 
only n nodes are employed (as in the Newton-Cotes formulae for an even 
number of uniform intervals or odd number of nodes). Thus, by Theorem 
1.3, we can conclude that such equal coefficient formulae are interpolatory. 
The error estimates of Section 1 are then applicable [say, of the form 
(1.4)]. 

Let us set n = 1 in (9) and (10). We find a 1 — b — a, x 1 = \{b 4- a), 
and the quadrature formula is simply the midpoint rule, 

h{f) = (6- a)/(L±-^)’ 

which is clearly exact for linear functions. 
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For n = 2, the system (10) is easily reduced to a quadratic equation 
which yields the nodes 


b 4- a b — a 

2 TTT 


x 2 


b + a b ~ a 


Note that in each of these cases the nodes are symmetrically located 
about the center of the interval of integration. 

In general, the solution of the system (10) determines the «th degree 
polynomial 


P n (x) = (x - XiX* - • (x ~ x n ) 


whose roots are the required nodes. This polynomial can be written in 
the form 

(11) P n (x) = jc ft + + o 2 x n ~ 2 H-(- ct„, 

where the coefficients are the classical elementary symmetric functions of 
the roots, i.e., 

<j 1 = — (*! + x 2 4* ■ • • + ^ n ) 

a 2 = (*lX 2 + *1*3 + • ■ • + *n-l*n) 


a n = (- 1)^X2 • ■ * n . 

However, the values of these symmetric functions can be obtained from 
the sums of the powers of the roots 

S v = X^ + X 2 V -h *n V > V “ 1, 2, . . H, 


which are directly determined in (10). The relations between the a j and 
the S v are known as Newton's identities (see Problem 2), 


( 12 ) 


51 + a i 

5 2 T + 2ct 2 


- 0 
= 0 


Sn + S n - 1 (T 1 + S n -2 G 2 + * • * 


+ Sl a n-1 + K a n — 0 . 


Thus the determination of the nodes has been reduced to finding the roots 
of the polynomial (11). The coefficients of this polynomial are recursively 
computed from (12) by using the known values of the S v . 

The nodes for any interval [a, b ] are easily obtained from the nodes for 
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the special interval [-1, 1]. For this purpose we introduce the usual 
linear change of variable 


(13) 


JC = 


b + a 

~~T~ 



a 


and then note that 

/{/} = £/W dx = b -=^ j£ g(y) dy = b -=^ J{g(y)}, 

where 

. lb + a b - a\ 

gLy)=f[—+y— )■ 

The tt-point uniform coefficient quadrature formula which approximates 
J{g } is written as 

Jn{g) =P« i g{y,)- 

J = 1 

In order that this formula have degree of precision at least n , we must 
have J{y v } — J n {y v } for v = 0, 1,.. n. Thus p n = 2 jn and 

(I 4 ) = t y,' = 2iv n ~ } [1 + (-1)1, v = 1. 2 

We note that the odd order power sums now vanish, i.e., = S 3 = • * • 

= 0. Newton’s identities (12) become in this case 

or = 0 


2 T 2<t 2 — 0 

(15) a 3 = 0 

n n . 

— + -a 2 + 4 0-4 — 0 

CT S = 0 


Thus we find that the odd order elementary symmetric functions vanish 
and the polynomial for the determination of the nodes, y f , becomes 

(16) P n (y) = y n + o 2 y n ~ 2 + a */ 1-4 + • • ■- 


The roots, of P n (_y) = 0 are thus symmetric with respect to the 
origin and if n is odd, then y — 0 is a root. Using the transformation (13) 
we obtain for the nodes of the general quadrature formula (9a) the values 


x s 


b + a b — a 
o - 


2 


2 
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From the properties of the it follows that the nodes x, are symmetrically 
located with respect to the midpoint of the interval of integration and that 
for n odd the midpoint is a node. If n is an even integer then by the sym¬ 
metry of the we have 

2 y? +1 = o. 

1 = 1 

That is, J n {y n + 1 } = J{y n + 1 } and the degree of precision is n + 1, when an 
even number, n , of nodes is employed. The same must, of course, be true 
in the general case (9) for n even. 

In order to determine an «-point quadrature formula of the form (9) 
which has degree of precision at least n , the polynomial (16) must have n 
real distinct roots. However, for n ~ 8 and for all « > 10 it can be 
shown that a pair of complex roots occurs. For n < 1 and n = 9 the roots 
have the required properties and are known to many decimals. We list in 
Table 1 these roots in 0 < y < 1; the others are obtained by symmetry. 


Table 1. 


n = 1 

n = 2 

warn 

n = 

4 

0.0 

0.57735 02692 

0.0 

0.18759 

24741 



0.70710 67812 

0.79465 

44723 


n = 5 

n = 6 

n = 7 

n = 

9 

0.0 

0.26663 54015 

0.0 

0.0 


0.37454 14096 

0.42251 86538 

0.32391 18105 

0.16790 

61842 

0.83249 74870 

0.86624 68181 

0.52965 67753 

0.52876 

17831 



0.88386 17008 

0.60101 

86554 




0.91158 

93077 


Quadrature formulae with uniform coefficients and any number of 
nodes arise in Subsection 4.1, in another setting. 


PROBLEMS, SECTION 2 


1. Verify that (7) characterizes the solution to the problem of minimizing 

n n 

2 oq 2 subject to 2 a i = b — a. 

/ = i > = i 
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2. Verify Newton’s identities (12). 
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n 


n 


[Hint: Let P n (x) = I~[ (x — *>) = 2 

y = i fc = o 




^n'(x) = n (X - x i) = 


" P„(x) = i P„(x) - />„(x,) 

/ = i * - Xj y ti x - x, 


But, 


iVx) - /*„(*;) 

X — Xj 


n . v n - fc _ v n ~fc\ 


= 2 °ki.X n k 1 + X,X n k 2 + * * • + X? fc 2 X + 


= X n 1 + 2 (°p + Vp-iXj + * ■ * 4- a 1 xf 1 + x ; p )x n p L 

p = 1 


Hence 


P n '(x) — nx n 1 + 2 ( na p "l* Si a P-i + ■ • • + S p ~i<y 1 + S p )x n p *]* 

p -1 


3. GAUSSIAN QUADRATURE; MAXIMUM DEGREE OF PRECISION 


In Section 1 it is shown that, given n 4- 1 fixed nodes, we can determine 
the coefficients of a quadrature formula which has degree of precision at 
least n (by forming the appropriate interpolatory formula). In Subsection 
2.1 we investigated the problem of determining n nodes such that all the 
coefficients are equal and the degree of precision is at least n. Now we 
allow a choice of both n nodes and n coefficients in order to determine 
formulae with maximum degree of precision. Of course, the degree of 
precision for such a formula will not be less than the corresponding degree 
for the interpolatory formula using the same nodes. Hence by Theorem 
1.3 we conclude that the quadrature formula with maximum degree of 
precision is interpolatory . 

If the formula is to have the n nodes x l5 x 2 , ..x„, it can be written as 


a) 


f f(x) dx = 2 “//(*>) + E n{f}- 
Ja i = 1 


However, since it must be interpolatory, the error can be written as, 
recalling (1.4a), 

rb 

( 2 ) 


£„{/} = f Pn(x)f[x u ...,x n ,x] dx, 
Ja 


where we have introduced 


(3) p n (x) s (x - Xi)(x - x 2 )- • (x - x„). 
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Clearly, E n {f} — 0 if f(x) is a polynomial of degree n — 1 or less. We 
seek points x, such that the error also vanishes when f(x) is any poly¬ 
nomial of degree n + r where r = 1, 2 ,..m and m is to be as large as 
possible. 

To determine such nodes we first recall that the «th divided difference 
of any polynomial of degree n 4- r is a polynomial of degree at most r. 
(This follows by repeated application, n times, of the result in Problem 
1.2 of Chapter 6.) Thus from (2) we conclude that the necessary and 
sufficient conditions for E n {f} to vanish for all polynomials, f of degree 
n + m are that 



However, these are just the conditions that the polynomial p n (x) be ortho¬ 
gonal, over [ a , b], to all polynomials of degree at most m. In fact, if we 
take for p n (x) the ni\\ orthogonal polynomial, then (4) is satisfied for 
m = n — 1. Further, (4) cannot be satisfied for m = n or else p n (x) would 
have to vanish identically which is impossible. These results may be 
summarized as 


theorem 1 . The quadrature formula in (1) can have the maximum degree 
of precision 2n — 1. This is attained iff the n nodes , are the zeros of 

p n (x ), the nth orthogonal polynomial over [a, b\ and the formula is 
interpolatory. ■ 


The formulae determined by Theorem 1 are called Gaussian quadrature 
formulae. From Theorem 3.4 of Chapter 5 it follows that the nodes are all 
interior to the interval of integration, {a, b). The coefficients in these 
formulae are easily obtained (once the nodes are determined) since they 
are interpolatory; we get as in (l.3b) 


(5) 


«/ 


■ r_£sw *, 

Pn (Xj) J a X - Xj 


j = 1,2 


Although it is not apparent from this expression, we have 


theorem 2. The coefficients , a y , in the Gaussian quadrature formulae are 
positive for all j = 1, 2 ,..n and all n. 


Proof Since the Gaussian quadrature formula with n nodes has 

degree of precision 2n — 1, it yields the exact value for f f(x) dx when 

Ja 

f(x) is any polynomial of degree 2n — 1 or less. In particular, then, it is 
exact for 


<h{x) = 


{X - Xj f ’ 


j = 1, 2 
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which are polynomials of degree 2n — 2; i.e., 
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q f (x) dx = ^ a kq,{x k ). 

a fc = l 


However, it is clear that 

q,(x k ) = 0, for k + j; 


q,(x,) = n (*, - *0 2 = > 0. 


1 = 1 


Thus we find 


q^i) I q,{X) dX [Pn '(^)] 2 la 


PnHx) 

(x - x,)- 


dx > 0. 


Note that the only property of Gaussian quadrature used in this proof 
is the fact that the formula with n nodes has degree of precision at least 
In — 2. Thus we may also conclude that any quadrature formula of the 
form in (1) using n nodes and having degree of precision In — 2 has 
positive coefficients. Another formula for computing the coefficients a j is 
derived in the next section [see equations (4.5)]. 

We can obtain expressions for the error in Gaussian quadrature which 
are more useful than that given in (2). The first such result can be stated as 

theorem 3. Let f\x) be continuous in the closed interval [ a , b]. Let 
iu * • *> fn be any n distinct points in [a, b] which do not coincide with 
the zeros , x ly x 2 ,. - x ny of the nth orthogonal polynomial , p n (x)< over 

rb 

[a, b]. Then the error in n point Gaussian quadrature applied to f{x) dx 

Ja 

is 

(6) £■„{/} = f Pn(x)(x - &)• • (x - £„)/[*!,.. X n ,^, x] dx. 

Ja 

Proof Using the 2 n distinct points and the function f(x) can be 
written as 


fix) = Pan-lix) + Rzn-l(x), 

where P 2n ~i( x ) is the interpolation polynomial of degree at most 2 n — 1 
agreeing with f(x) at the In points Xj and and R 2n -i(x) is the inter¬ 
polation error. With Newton’s form for the remainder we write this error 
as 

n 

Rm-i(x) = FI K x - x Di x ~ • ■•,*». fi, £ n , x], 

1 = 1 
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These expressions in (1) yield 

(7) 


f 6 An -1 (x)dx + f R 2 „- i(x) dx 

J a Ja 

n n 

= 2 “An-lOj) + 2 “A-lW + £■-.{/}■ 


y = i 


However, since the degree of precision of the quadrature formula is 
2n — 1 we must have 


f P an -l(x)dX- 2 “A,-l(4 

Ja 7 = 1 


Also, since f\x) is continuous, it follows that f[x u . .., x n ,£ u ..£ n , x,] 
has a finite value for j — 1,2,...,/! and so = 0* By using these 

results in (7), we obtain (6). ■ 

We note that there is great freedom in the choice of the n points in 
Theorem 3. Further, the conditions on f{x) could be relaxed somewhat 
to require only continuity of f(x) on [ a , b] and differentiability at the points 
x j and ij and the error representation (6) remains valid (see Problems 
1.4 through 1.6 of Chapter 6). By requiring more differentiability of f(x), 
the result in Theorem 3 can be simplified. The most common such 
simplification is stated as the 

corollary. Let f(x) have a continuous derivative of order In in [a, b]. 

Then the error in n-point Gaussian quadrature applied to f{x) dx is 

J a 


( 8 ) 


E n {f} 


y<2n>^ rt> 

w 


J Pn(x) dx, 


where f is some point in ( a , b). 

Proof. Under the assumed continuity conditions on f(x) the integrand 
in (6) is a continuous function of the n points f n * Thus it is 

legitimate in this integral to let -> Xj for j = 1, 2,.. «, and obtain, 

by applying the mean value theorem for integrals, 

£■„{/} = f Pn\x)f[x u . .x n , x u ..x n , x] dx 
J a 

= f[x u ...,x n ,x u ..., x n , r,] f p n 2 (x ) dx, a < 1 ) < b. 

J a 

The result (8) now follows from the extension of Corollary 2 of Theorem 
1.2 in Chapter 6. ■ 
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It should be recalled in all of these results that p n (x) is not the normalized 
orthogonal polynomial of degree n over [a, b], but, by (3), is the one with 
leading coefficient unity. So if the «th degree orthonormal polynomial is 

Qnix) = a n X n + On-!*"- 1 +- \- a 0 , 

then, since p n {x) and Q n (x ) have the same zeros, 

Pn(x) = ~ Qn(x). 
a n 

Thus we deduce that 


(9) 


rb 1 ro 1 

Pn 2 (x) dx = -2 Qn 2 (x) dx = — 5 
Ja Ja “n 


For example if a = — 1 and b — 1 then the Legendre polynomials, P n (x), 
are the relevant orthogonal polynomials. It can be shown that 

1 d n 


Pn(x) = 2^!^ (x2 - 1)n 


and they are normalized by forming 
case that 




(jc). Thus we find in this 


= JML /: 

n 2 "(«!) 2 V 


2m + 1 


"(«!) 2 V 2 

and the error expression (8) becomes for Gaussian quadrature over 

[-1,1]: 


( 10 ) 


En{f) = 


2 r2 n (M!) 2 l : 

(2m + 1)! [ (2m)! J 


/ <2n m 


-1 < £ < 1 . 


4. WEIGHTED QUADRATURE FORMULAE 

It is of practical and theoretical interest to consider the approximate 
evaluation, for a fixed weight function w(jc), of integrals of the form 

(1) W{g} = f g(x)w(x)dx, 

J a 

by quadrature formulae of the form 

(2) W n {g} = J p jg ( Xj ). 

y = i 

We again call the points jc y the nodes and the £ y the coefficients of the for¬ 
mula. However, only the factor g(x) in the integrand enters directly into the 
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evaluation of the integration formula (2). The weight factor, w(x\ enters 
into the determination of the coefficients and nodes. Once these are de¬ 
termined the formula may be applied to integrals of the form (1) with 
different functions g(x) but the same weight function. Formulae of the 
form (2), when applied to approximate integrals of the form (1), are called 
weighted quadrature formulae . To evaluate integrals of the form (0.1) 
by such formulae, write 

JVw dx = £ ^ w(x) dx, 

and use (2) with g(x) = f(x)/w(x ). As we shall see, there are frequently 
advantages to such procedures. 

Many of the previous results are valid, with but slight changes, for the 
weighted quadrature formulae. Their degree of precision is defined as 
before; i.e., (2) has degree of precision m if 

W n {x k } = W{x k }, for k = 0, 1,..., m, 

but not for k = m + 1. 

Given n + 1 distinct points x 0 , Xi, .. x n the weighted interpolatory 
formula with these points as nodes and weight function w(x) over [a, b] is, 
say, 

(3a) W n + 1 {g} = 2 w n.ig(Xf); 

1 = 0 

where 

(3b) w nJ = f 4> n-t {x)w{x) dx, j = 0, 1,.... n. 

J a 

Here the <f> n j(x ) are the Lagrange interpolation coefficients for the points 
x 0> x u ..., x n . Since g(x) = P„(x) + w n (x)g[x Q , ...,x n ,x] where P n (x) is 
the Lagrange interpolation polynomial of degree n, the error in (3a and b) 
is 


(3c) 


£n + 1 {g} = W{g) - W n + l {g} 

= f lg(x) - P„(x)]w(x) dx 
Ja 

= J <^n(x)g[x 0 , ...,X n , x]h>(x) dx. 


By assuming sufficient differentiability of g(x) we can simplify this error 
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expression. Also if aj n (x)w(x) does not change sign in [a, b], even further 
simplification is possible. The case of equally spaced nodes does not yield 
particularly simple error expressions for arbitrary weight functions, h>(x), 
and so we do not study these formulae further. It should be observed, 
however, that a modification of the method of undetermined coefficients 

still applies (i.e., the right-hand sides in (1.16) are changed from f x v dx 

J a 

to x v r’(x) dx). Hence, the result of Theorem 1.3 is valid when modified 
J a 

to refer to weighted formulae. 

If the weight function, w(x), is positive in [a, b ] and say for simplicity 
continuous, then the Gaussian quadrature formulae also generalize in an 
obvious manner. These generalizations are best derived by seeking the n 
nodes, x fy and coefficients, such that the weighted formula (2) will 
have the maximum degree of precision. We find now with q n (x) s 
(x — xf) • • (x — x n ) that 

theorem 1. The weighted quadrature formula (2) has degree of precision 
at most 2n — 1. This maximum degree of precision is attained iff the n 
nodes , x jy are the zeros of q n (x) y the nth orthogonal polynomial with respect 
to the weight w(x) over [a, b]. The formula is a weighted interpolatory one . 

Proof The details of the proof are left as an exercise to the reader. 
They follow closely the proof of Theorem 3.1. ■ 


The coefficients of the weighted formula of maximum precision are given 
by 

(4) ft = } f w(x) dx , j — 1,2,.. n. 

Exactly as in Theorem 3.2 it follows that ft > 0. The coefficients, ft, are 
called the Christoffel numbers. The formulae (2) are of the type frequently 
called weighted Gaussian quadrature with special names applied for special 
weight functions (see Subsection 4.1). 

The coefficients ft of the weighted Gaussian formulae can be expressed 
in a simpler form than that given by (4). For this purpose let P n (x) denote 
the «th orthonormal polynomial over [a, b] with respect to the given weight 
function, w(x). If the leading coefficient of P n (x) is a n , we have P n (x) = 
a n q n (x) and hence from (4) 


& = 


1 r Pn(x) 
P n '(x,) Ja X - Xj 


w(x) dx. 


Now set f = x k in the Christoffel-Darboux relation (3.25) of Chapter 5, 
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multiply the result by w(x)/(x — x k ) and integrate over a < x < b. 
This yields, since P n (x k ) = 0 and 

/>°(.v) = [ J H'(x) t/xj 

1 = ^ f rP^ T \ w(x) dxP n*M k = 1, 2,.. n. 

Use this in the previous expression for to obtain 

(5a) = a n P n \xlp\^{ Xj ) 

From the three term recursion of Theorem 3.5 in Chapter 5 we find that, 
since x f is a zero of P n (x ), 


and (5a) becomes 
(5b) 


Pn + l(Xf) - 


ft 


Pn- l(4 


a n . 1 P n '(x l )P n _ i (x,) 


It should be observed that the coefficients, of the ordinary Gaussian 
quadrature formulae are also given by the above formulae, (5), in which 
the P n (x ) are orthonormal with respect to the uniform weight, w(x) = 1. 

The errors of the weighted Gaussian formulae are derived exactly as in 
Theorem 3.3 and its corollary. Thus under appropriate continuity con¬ 
ditions on g(x) we have 


(6) E n {g) = W{g) - W n {g} 

= J <ln(x)(x - fi)- • (X - £„) 

X g[x u ..x n , ii, ■ ■in, *M*) dx 
= J q n \x)g[x 1, . . x n , X U . . X n , x]w(x) dx 


S[*l, 


X n , * 1 , . 


v] f q n 2 (x)w(x) dx 

Ja 



q n 2 (x)w(x) dx. 


a < £ < b. 


4.1. Gauss-Chebyshev Quadrature 

The polynomials orthogonal over [—1, 1] with respect to the weight 
w(jc) = (1 - x) _p (l 4- -x:)" 4 , provided p < 1 and q < 1, are known as the 
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Jacobi polynomials. The special case, p = q = arises in the treatment 
of integrals of the form 

(7) 7 

That is, consider the orthonormal polynomials over [—1, 1] with respect 
to the weight function (1 — x 2 )~ 1/2 , say g n (x), such that 

J_ x Qn( x )Qm(x)~j==== = & ntTn . 

Introduce the change of variable 

x = cos 9, 0 < 9 < 7T, 

and these integrals reduce to the form 

f £> n (cos 9) (? m (cos 9) d9 = S n>m . 

Jo 

In Problem 3.9 and equation (4.13) of Chapter 5 we verify that the poly¬ 
nomials are 

ft 

Q n (x) = /~ cos (n cos 1 jc), n — 1, 2,..., 

(8) 

Cow = 4=- 

V 77 

The nodes for the w-point quadrature formula of maximum degree of 
precision are, by Theorem l, the points x i such that 

Qn(x,) = 0 , -1 < X, < 1 . 

But from (8) the zeros are 

(9) Xj = cos 9 j = cos wj. j = 1, 2,..., 

The Christoffel numbers, f$ j9 for this best formula are most easily 
evaluated by using (5). That is, from (4.13) of Chapter 5 and (8) 

Q n (x) = 2"- 1 /?x n +---, for n — 1,2,.... 


Hence 


2 ’ 
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But by using (5b) 

ft = 


Qn(Xj)Qn-l(Xi) 


for n — 2, 3, . 


therefore, from (9) with 

ax ax a 9 


ft = ^; 


sin 6< 


n sin n0 } cos (n — 1)0 j 
Now cos (n — 1 )0, = sin 0, sin «0 y , whence 


( 10 ) ft = J ./= 

The quadrature formulae thus derived are 

(11) «= 1,2,.... 

n j =i 

It is of great interest to note that eac/z such formula has uniform coefficients 
and that the /i-point Gauss-Chebyshev formula (11) has degree of precision 
2/1—1. Thus the problem posed in Section 2, of choosing coefficients 
to minimize the mean-square roundoff error in evaluating the sum (2), 
is solved by the same coefficients that yield maximum precision in approxi¬ 
mating integrals of the form (7). 


PROBLEM, SECTION 4 

1. Carry out the proof of Theorem 1. 


5. COMPOSITE QUADRATURE FORMULAE 

By Theorem 1.3 (and its generalization for weighted quadrature) we 
see that all of the formulae considered thus far have been interpolatory. 
Thus, in effect, the integrand has been approximated by a single polynomial 
over the entire interval of integration and the integral of this polynomial 
is the approximation to the integral. (This is the justification of the name 
simple quadrature formula.) In order to get reasonable accuracy over a 
large integration interval, low degree polynomial approximations would in 
general not suffice. We learned in Subsection 3.4 of Chapter 6 that a high 
order interpolation polynomial may be a poor approximation to a smooth 
function in the case that the nodes are uniformly spaced. Hence we avoid 
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using integration formulae based on interpolation polynomials of high 
degree and uniformly spaced nodes. On the other hand, the coefficients 
and nodes for formulae with maximum degree of precision are not avail¬ 
able for large orders and are difficult to compute with great accuracy. 
Nevertheless, it is possible to devise quadrature formulae which have 
simple coefficients and nodes, and yet yield accurate approximations. 
These are the so-called composite rules which, in brief, are devised by 
dividing the integration interval into subintervals (usually of equal length) 
and then applying some formula of relatively low degree of precision o v er 
each of the submtervals. There are many composite quadrature formulae, 
and we only examine those most commonly used. 

Let the integral to be approximated be 

(1) I* f(x) dx. 

J a 


Given integers m and n , define 

(2a) H = L=_f, 

m 


h 


H 

n 


and divide [< a , b] into m subintervals, each of length //, by the points 

(2b) y } = a -b ///, j = 0, 1,..., m. 

Each of these subintervals is divided into finer subintervals of length h 
by the points 

(2c) x, c = a + kh, k = 0, 1, .. mn. 

Now (1) may be written as 

fb m /*y 

(3) f(x) dx = ^ I /(*) dx - 

Ja ; = i J y j _ ! 

By using the appropriate points x k of (2c) each of the m integrals on the 
right-hand side of (3) can be approximated by a closed Newton-Cotes 
formula with n + 1 nodes. That is, by adapting the notation of Section 1, 

( 4 ) f ’ f(x)dx = 2 wtf.fc/O'/- 1 + kh) + EflfJ}, 

Ji/y-i k = 0 

j = 1, 2, . . m; 

where E ( n J { A {/} is the error in the (n l)-point formula applied to the 
/th integral and w\l] k is the /cth coefficient for the jih integral. 

The coefficients, are independent of j\ In fact, from (1.15), we may 
write 

(5) w<A = hA n _ k ( 0, n) = hA n lc , 
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where A n>k depends only upon the integers n and k. That is, corresponding 
coefficients in each subinterval are equal. With the use of (4) and (5), 
equation (3) becomes 

/•{j 771 71 m 

( 6 ) f(x) dx = h 2 2 A ».*Ay/-i + kh) + 2 EnlAf} 

J a j = 1 Ic = 0 7 = 1 


n r m ~\ m 

= h 2 A n . k 2 f(y> - 1 + kh) +2 

fc = o L y = i J j = i 


However, since y f — y^ 1 T nh it is seen that the values of the integrand 
at the points y i with j — 1, 2,.. m — 1 appear twice in (6). We account 
for this repetition and rewrite the sum in the form 

(7) f fix) dx = hlA n _ 0 fiy 0 ) + A nirl fiy m ) + iA n , 0 + A n , n ) 2 fiy,) 

Ja { 7=1 

n -1 r m p 

+ 2 A 'A 2 1 + **) f + l{/}• 

= i L y= i J J 

Here we have introduced the composite error 

(8) £*.» + 1 {/} - 2 £ nh{f}- 

i= 1 

Since the same closed Newton-Cotes formula has been employed over 
each interval [.ty-i, >v], we deduce from (1.11) and Lemma 1.1 applied 
to (8), that 

£m '" +l{/} = (TT%! hn+3 f (n+2) W’ 

M n = f fir n (f) dt < 0, n even; 

Jo 

(9a) 

£ m.n + l{ /} = ^^ + 2 / <n + 1 m 

M n = j 7T n (t) dt < 0, n odd. 

Jo 

Here a < £ < b and we have assumed that the indicated derivative of 
f(x) is continuous on [a , b]. By (2a) this error can be written as 


(9b) E mtn + 1 {f} = J 
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Thus for fixed «, the error can be made arbitrarily small by letting H 0. 
In this manner, we find that composite quadrature formulae may be very 
accurate when applied to integrals whose integrands do not possess high 
order derivatives. 

The most common composite formulae are those with n = 1 (trape¬ 
zoidal rule) and n — 2 (Simpson’s rule). For the trapezoidal rule we have 

h - H = b -^r 

and (7) and (9) yield 

(10) £/(*) dx = * | f(a) + f(b) + 2 2 'J(a + jh)j 

- h 2 f 2 >(0- 

For Simpson’s rule, with n — 2 in (5) and (9), 

h= J = ^T’ A ^o = A 2.2 = b = M 2 = —^; 
so that (7) becomes 

rb L f m - 1 

(id j a f( X )d X ="|/( fl ) + m + 2 2 /(«+ 2//o 

+ 4 i/(« + [2/ - !]/«)} - ^ A 4 / (4, (a- 

We note that in formulae (10) and (11) the coefficients are all positive 
and so the roundoff errors are not generally magnified. In fact, the Newton- 
Cotes closed formulae have positive coefficients for n < 8. 

In practice, the nodal tabulation of f(x) in [a, b ] may not permit the 
use of the composite Simpson’s rule because the number of net points, 
N + 1, is even. That is, the uniformly spaced points are x f — x 0 4- jh, 
such that x 0 = a, x N = b. 

In this case we could use the closed formula 

£ + 3h /W dx = j [/(a) + 3f(a + h) + 3 f{a + 2 h) + f{a + 3h)} + E„ 


£* =-^/ <4,( a, a<£<a + 3h. 


with 
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The remaining integral 

f f(x) dx 

Ja + 3h 

can then be evaluated by the composite Simpson’s rule. 

The error term is comparable to the error term 

£3 = -(/t790)/ <4 >0)). 

This illustrates the general principle of forming composite rules with 
simple formulae of comparable accuracy. 

5.1. Periodic Functions and the Trapezoidal Rule 

Experience has shown that if f(x ) is periodic, i.e., f(x + L) = f(x ), 
the formula for the integral over a period, 

(12) £ f{x) dx Z 2 /(*>)> x > = J% h = 

is remarkably accurate. One possible explanation is that (12) arises from 
the composition of formulae having an arbitrarily high degree of precision. 
That is, from the Euler-Maclaurin summation formula (4.22) of Chapter 6 
(also see Problem 4.7 of Chapter 6) for p = 1, 

rVw dx = hfix 0 ) + ^ [/(*,) - fix 0 )] 

Jx 0 L 

- Y2 LT(*i) -/'(*»)] 

+ ^[/«>(x 1 ) - / <3 »(a 0 )] 


If the above is composed for all of the intervals (x y , x y + 1 ) where 
0 < / < N — 1, we find that the terms in brackets cancel in the interior 
for any function /(x), but also cancel at x 0 and x N since f(x) is periodic. 
We have, in fact, 

theorem 1. If f(x) e C 2m + 2 [0, L] and /(x) is L-periodic, then the composite 
trapezoidal rule , (12), has the error 

€n S Jo dx ~ It a/(^o) + fix 1 ) + • ■ • + f(x N .j) + ifixf] 

where 

e N = h 2m + 2 LC m / <2m + 2 \f), 
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with some £ such that 0 < £ < L, and with a constant C m that is in¬ 
dependent of N and f(x ). 

Proof. Note the central difference integration formula that is derived 
in Problem 1.5, 

J*' + 7(*) dx = \ (/, + /, + 1 ) + hr t (8% + S 2 / / + 1 ) + • • • 

+ hr m (S 2m /y + S 2m / /+1 ) + C m h 2m+3 / <2m + 2) (f ,)• 

Add the formulae for each interval (jc,, x j + 1 ), 0 < j < N — 1. The 
difference terms contribute nothing to this sum. That is, with the notation 

*Pj = ifj + fj +1 + ' 1 ’ +// + jv-i + if + Ni 

the integral becomes 

[ XN f{x) dx = A[0 O + 2r 1 8^o + 2r 2 S 4 0o + • • • + 2r m S 2m 0 o ] 

+ /i 2m+3 c m 2 1 / <2m+2) (fi). 

^ = o 

But it is easy to see from the periodicity of f(x) that 

ii = 2 = ^ 

s = 0 


for all integers p . Hence, in particular, 

S 2k t// 0 = 0 for all integers k > 1. 
Therefore, by Lemma 1.1 and the definition of h in (12), 


(13) f*7W dx = h[if 0 +fi+f 2 + ■■■+/*-! + i/«] 

+ A 2m + 3 C m 2 1 / <2m + 2, (Q 

y = o 

= A[i/o + fi + ./i H— • + yiv-i + i/jv] 

+ h 2m + 2 LC m f (2m+2) (£), 0 < £ < L. 


5.2. Convergence for Continuous Functions 

In the event that the function f(x) is merely continuous (or piecewise 
continuous, with jump discontinuities), we can still prove convergence of 
composite quadrature formulae that have non-negative coefficients and 
degree of precision s > 0, as in 

theorem 2. With the notation of 2 (a, b, and c), let 

$».„{/} * f S mAf}’ 

>=1 


(14) 
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where 

05 ) «?,{/} = J <’*/(>',-1 + A ://). 

Ac = 0 

If fix) is continuous in [a, b], af k > 0 and if Sj£ n has degree of precision s 
(i.e., 

WJg] = P g(0 d£ 

Jy, -! 

for g(£) = 0 < p < 5), then 

lim S m .„{/} = f Rx)dx. 

m~*oo J a 

Proof . As m tends to infinity the closed intervals [yj~i,y } ] become 
arbitrarily small. Hence given any e > 0 there is an M such that if m > M 
there exist polynomials for j = 1, 2, . . ., m, of degree at most s , 

such that 

max |/(x) — < e for j = 1, 2,.. m. 

y j _ x ^xsy j 

Now 

(16) IP f(x) dx - S% n {f} < f y f{x) dx - p Pf(x)dx 

I «Vf - i •'yy -i Jyi -i 

+ 1 r pf(x) dx - sfjf) 

I Jyj -1 

< «|* - y ,~il + -/}| 

The fact that has degree of precision s was used to obtain the last 
term on the right-hand side. 

The fact that S£p n {g} is exact for g(£) identically constant and that 
a ( f k > 0, implies that 

2 K*l = l y> - yj- il- 

k =0 

Hence (16) yields 

r fix) dx - $<£,{/} < 2e\y f - *-i|. 

Therefore, 

f f(x) dx - S m , n {f} < 2e\b - a\. ■ 

J a 

By picking P'fix) = /(£), where £, = iy j . 1 + y,)j2, we find the 
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f fix) dx - S m , n {f}' 

J a 


< 2\b - a\w(f; 8), 


where w is the modulus of continuity and 8 = max \y^i — y } \/2. ■ 

j 

We leave to Problems 8 and 9 the formulation of generalizations of 
Theorem 2. 


PROBLEMS, SECTION 5 


1. At what interval in * and to how many decimal places must f(x ) be 

tabulated in order to evaluate fix ) dx correctly to six decimal places by 

Jo 

using: 

(a) composite trapezoidal rule, 

(b) composite Simpson’s rule, for f(x) = cos x? 

2. The composite midpoint rule is based on the single node, open Newton- 
Cotes formula with error, E t : 

/•a + 2 h 

f(x) dx = Ihfia + h) + E x . 


Find the expression for the composite formula and error term when the 
integral to be evaluated is 


»a+2mh 

*> a 


fix) 


dx , 


m = 


b — a 
~2h 


3. Answer Problem 1 for the composite midpoint rule (see Problem 2). 

4*. Use the notation of Problem 1.5 to derive the composite trapezoidal rule 
with end corrections : 

J fix) dx = h j^(i/o + /i + * • * + /at - i + \fn) 

+ 2 r*id N . k ~ d 0<k )] + C m ix N - x 0 )h 2m + 2 f 2m + 2 \£)> 

k=l J 

where 

dj.k = ^ 2 k ~ 1 fi-k + ^ 2 k ~ 1 fj-k+u 

and 

x- k < f < X N + k . 

[Hint: Use equation (3.16a) of Chapter 6 to get 

8 2k fs = A 2k-i fs _ k+1 _ A 2 *-y s _ k , 


N~ 1 


2 
■ = 1 


8 2k f s = A 


A 2 ''- 1 /!-.-] 


whence 
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Since the coefficients of {/,} which occur in A n / 0 are the alternating binomial 
coefficients (— we see that 

d it k = 2 + 

v- -k 

where 

c 0 M = 0, Cp <w = -c 1 -),, p = 1,2 
5.* Given the integer N > 0, let h = 1/JV, x 0 = 0, Xj = Ao 4- jh\ 
t N = h(%f 0 + /i + * * * + /n-i + i/tf)- 
(a) From Problem 4, verify that if f(x) e C 2m + 2 , 


/• 1 m 

f(x)dx — t N = 2 + (P(h 2m + 2 ). 

Jo J = 1 


[Hint: Expand the values/ p and f N + P , which appear in the end corrections, 
by Taylor’s series about x 0 and x N respectively. Collect terms with like 
powers of h.] 

(b) Let 

S N = h[fy 2 4- fy 2 + ■ • • + /( 2 jv- i)/ 2 ] • 
s N is the composite midpoint rule for evaluating 


r x N 

/« - 

Jx o 


(with intervals h/2 —see Problem 2). Verify that 

^2 N — + S n)> 


(c) Show that 


e 2N = jV(-v) dx - , 2N = | afy 2 ’ + &(h 2m * 2 ). 

lat 

^ 3 - ~ gw = J Vw dx - ( 4taN -~ ) 


^ 4^2,V ~ 


= 2 M 2/ + ®(h 2m + 2 ) y 

i = 2 

and b r — 0 if a r = 0. 

Call t$} = (4/ 2 iv — t N )/ 3 the extrapolation of the trapezoidal rule. 

6.* Romberg's method : With the notation of Problem 5, consider the 
sequence of subdivisions obtained by halving, i.e., N = 1, 2, 4, • * 2 k \ . ... 

Define 


y | # (0) # (0) 
„(i) 4r 2 fc — »2 fc - 1 

t 2 )c = -^- 

.«> _ 4*/S> - *«-, 
' 2 * = ■ 4 2 - 1 


* = 0 , 1 , 2 ,...; 


A: = 1,2,3,...; 


A = 2,3,4,...; 


^r 11 - 


k = p,p + 1,.... 
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t ( 2 i ] is ca]]ed the pth extrapolation of the trapezoidal rule. Show, by induction 
on p , that for fixed p > 0, 


1 f(x)dx - t$ = (P(2~ ki2p + 2) ) 


if p < m. 


Romberg’s method consists in successively constructing the rows of the 
triangular matrix R : (r te . p ), with r k , P = t™. If a p 0, f{x) e C 2m + 2 [-8, 
1 + 8] for some 8 > 0, and 0 < p < m, then, by (c), the entries in the pth 

column converge more rapidly than those of the (p — l)st column to f(x)dx. 

Jo 

In Romberg’s method we may achieve the degree of precision of the end 
correction formula and avoid the evaluation of f(x) outside of the interval 
[x 0 , xsl I n practice, the rows of R may be successively computed until the 
elements in some column have ‘‘converged” enough. 

7.* Prove that the pth extrapolation of the trapezoidal rule is a quadrature 
formula with non-negative weights and degree of precision at least 2p 4- 1. 

That is, to approximate f(x) dx, 

Jo 

• 2- k 2 cflfu 2-*), 

i - n 


2 d p ] = 2 \ eg > o. 

) = o 


8. * State and prove a generalization of Theorem 2, to cover the case where 
f{x) is piecewise continuous (with only a finite number of jump discontinuities) 
and where the quadrature formula is not based on uniformly spaced nodes. 

9. * Under the conditions of Theorem 2, if f(x ) has a continuous derivative 
of order r, show that 

(a) If r < s. 


f(x)dx - S m . n {/) 


2 |b - a\~ co(/ (r >; 8), 


where 8 = max \yj-i - y, |/2. 

[Hint: Pick 

Pl‘\x) = M,) + r'KS.K* -£,)+••• + (x - (,y, 

with = (y } -i + y } )t 2. Verify that 

I Ax) - Pjftx) | < ^ 8) for < x < y,.] 

(b) If r > s, 


| £/W dx - S m ,„{/}| < 2\b - a\ 
where K = max |/ <s + 1> (*)|. 
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6. SINGULAR INTEGRALS; DISCONTINUOUS INTEGRANDS 


In deriving most of the quadrature formulae of this chapter and the 
appropriate error formulae, it was either stated or implied that the inte¬ 
grand and various of its higher order derivatives were continuous. (An 
exception is found in the weighted quadrature formulae where the weight 
need not be continuous.) If these conditions are violated, a quadrature 
method may still yield a good approximation but the error will, in general, 
be much larger than predicted. In less favorable circumstances, of course, 
the approximation will be meaningless. There are a number of cases of 
rather frequent occurrence in which such difficulties can be anticipated and 
satisfactorily resolved. 

6.1. Finite Jump Discontinuities 

If the integrand has a finite jump discontinuity at a known point (or 
any finite number of them), say at jc = c in the interval of integration, 
then we write 

(1) f f(x)dx= f f(x)dx+ f f{x)dx. 

J a J a J c 

Now if /(z) has sufficiently many continuous derivatives in [ a , c] and 
[c, b] the two integrals on the right-hand side may be accurately approxi¬ 
mated by any of a variety of quadrature formulae. This simple procedure 
can be considered as an application of a special composite rule, not 
necessarily one with equal spacing. 

If the integrand is continuous but has a discontinuity in some low order 
derivative, a similar procedure can be employed. For example, if /(jc) = \x\ 
then f\x) has a finite jump at jc = 0. In this case, a composite rule with 
jc = 0 as an endpoint of a subinterval could be used. 

6.2. Infinite Integrand 

We consider the case in which f(x) becomes infinite as jc -> a, the lower 
limit of integration. The upper limit can be similarly treated and an interior 
discontinuity, say at jc = c, is reduced to the endpoint cases by using (1). 
We assume for the present that the integral is of the form 

(2) '-[or^f dx - 0< " <l - 

where g(jc) has continuous derivatives in [a, b] of “sufficiently high” 
order. The restriction on 6 insures that the integral in (2) exists for rather 
general functions g(x) [i.e., it is not required that g(a) = 0]. 
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Method I. For any positive number €in0<e<6 — a we write (2) as 
/ = A + I 2 where 


( 3 ) 


A 


r +f gjx) 
a (X - ay 


dx. 


I2 


f 6 gjx) 
a + Ax- af 


The range of integration in / 2 is now such that the integrand has derivatives 
of high order there. Thus, / 2 can be approximated by many of the standard 
procedures previously described, and in principle the error in this approxi¬ 
mation can be estimated. It remains to approximate / t to within a known 
error. 

By Taylor’s theorem we have for x > a: 


g(x) = g(a) + (x - a)g\a) + ^ 2 ,^- g"(a) + • • • 


+ g (s K“) + (x (s + y g (s+ Amy 

Use this expansion in I x and perform the indicated integrations to find 

(4) T -i-4 g(a) 1 6 1 t lM 1 | g <s> (a) 1 

K> 1 [1 - e + 1! 2 - e + 2! 3 - 0 s'.s+l-6\ 

+ J (x - a) s + 1 - Y 3 + Am) dx. 

If the first (bracketed) term on the right in (4) is used as the approximation 
to / x we obtain the error bound 


( 5 ) 


|£l ” 1 i)!(, + 2-o .. i ?;. x .. l8l **“ (fll ' 


For fixed s this bound is clearly an increasing function of e. Or for fixed 
€ < 1, if the derivatives of g (s) (x) do not grow too fast with s it will also 
be a decreasing function of s. 

If the error in evaluating / 2 is E {2) we must now determine conditions 
on <r, s and the quadrature formula such that |£ ,(1) | + |Zs (2) | < S, where S 
is the maximum permissible error in approximating /. Of course, the 
parameters should be chosen such that |£' (1) [ = \E {2) \ since then some 
cancellation of error may take place. For definiteness, let us assume that 
f 2 is approximated by a composite rule using m subintervals and a closed 
(n + l)-point Newton-Cotes quadrature formula with equal spacing over 
each subinterval. Furthermore, we will assume n to be even. Then from 
(5.9) we must have 

d n + 2 g(x) 

(n + 2)! dx n + 2 (x — a) e 


£( 2 ) = 




a + € < £ < b; 
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where h = (b — a — €)/(mn ). From this expression we obtain the bound 

in which the coefficient of the derivative term is independent of t. If the 
derivatives entering into (5) and (6) can be estimated, say 

|£ (s + 1) (f)| < M (s + 1 \ 

and 

g(x) 


max 

a + esxil) 


d nH 


(x - ay 


dx n 


= N in + 2) (e), 


then for fixed s , n , and 5 we can find e and m such that 


M { 


s+i> == 


(5 + 1 )! 0 + 2 - 0 ) 

The bound jV (n + 2) (V) will, in general, become large for small e and so 
may imply an unusually large m. Thus, we consider an alternative procedure 
obtained by modifying these considerations. 


Method II. Let us rewrite the Taylor expansion of g(x) as 

(7a) g(x) = G s (x) + g (s + "MX)), 

where 


(7b) G s (x) = g(a) + (x - a)g\a) + ■ • • + — — g <s> (a). 

Now the integral (2) is identically represented by 


( 8 ) 


/ = 


n g(x) - G s (x) 
'a (X - ay 


dx 4- 


r g,(x) 

J a (X - Of 


dx = /jv + 


Ie' 


The second integral, I E , can be evaluated explicitly, just as was the first 
part of in (4), and we have 


(9) 



(b - a) g\a ) 

1 ! 2 - 6 


, (6 - ay g ,s, (a) 1 

+ j! j + i - e\ 

However, the first integral in (8), I N , no longer has a singular integrand 
at x = a so it can be approximated by many of the standard quadrature 
formulae. In fact, the first s derivatives of 


g(*) - G£x) 

(x - a y 
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are finite at x = a and so the error in any closed formula with n + 2 < s 
is bounded. For example, in a composite formula employing Simpson's 
rule to approximate f N the error becomes, from (5.9) with n = 2: 


— —HL h 5 — 
“90 dx 4 


g(x) - GJjx) ] 

. (x - a) 9 \ x 


t 


where 


h = —-> and a < £ < b. 

2m 

In order that the indicated derivative remain bounded as f -> a, it is 
sufficient that s = 4. Of course, if 0 < 5 < 4 the quadrature formula will 
still yield an accurate approximation (see Theorem 5.1) but the above 
form for the error cannot be used. 

Method III. As a third alternative, which is restricted to singular 
integrands of the form (2), we consider the change of variable 

(x — a) = t 0 dx = 4>t 0 ~ l dt. 

Then (2) becomes, if <j> = k/( 1 — 6) for k any positive integer, 

r(b - a) 1/d> 

(10) / = g(a + t d) )t k ~ 1 dt. 

Jo 

Now, since k > 1, the integrand of the above integral is continuous at 
t = 0. Thus numerical quadrature formulae may be directly applied to 

(10) . In fact, if 0 = qfp y i.e., rational, and k = p — q, the integrand is 
smooth as long as g is smooth. 

Methods I and II are applicable to other types of singularities. In fact, 
if the integral is of the form 

(11) f g(x)S(x)dx 

J a 

where -S(a) is infinite, Methods I and II may be applied if integrals of the 
form 

(12) J (x — a) k S(x) dx , k = 0 , 1, ..., 

can be explicitly evaluated. For example, if S(x) = In (x — a ), we can 
employ these methods. 

Method IV. In many cases of interest the singular part of the integrand, 
i.e., S(x) in (11), is of one sign throughout [ a , b]. Then, in principle, the 
weighted quadrature methods outlined in Section 4 can be employed. In 
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particular, the weighted Gaussian quadrature formulae are frequently 
very effective for evaluating such singular integrals. Of course, such appli¬ 
cations require the determination of the polynomials orthogonal over 
[a, b] with respect to the weight S(x). For many special forms of S(x) 
these polynomials are known; for instance, in the case 

(13) S(x) = (x — a)- l/i (b — x)~ v \ 

the Gauss-Chebyshev formula derived in Section 4.1 is the relevant 
scheme. However, even if the required polynomials are not well known, 
it may be of value to construct them and to devise the appropriate quadra¬ 
ture formula. This is especially true if many integrals containing the same 
singular part are to be evaluated. 

6.3. Infinite Integration Limits 

It is clear that an integral of the form 

(14) I = J g(x) dx 

cannot, in general, be accurately approximated by the standard quadrature 
methods (which employ a finite number of finite subintervals). The usual 
approach to such problems is to write again / = / + / 2 where now 

(15) A = f g(x)dx , 1 2 s f g(x)dx . 

J a Jb 

Then if b is “sufficiently large,” it may be possible by analytical means 
to show that / 2 is negligible. Or alternatively, g-(x) may be approximated 
for jc > b by some function from which / 2 is then approximated; in this 
case good error estimates are usually difficult to obtain. Another procedure, 
too frequently disregarded, is to reduce / 2 to an integral over a finite 
interval. 

That is, introduce the change of variable x = 1/f and obtain 
<16) 

rub 

= no dt 

Jo 

Here we have introduced the function 

(i7) m * 

Now if /(f) is not singular at f = 0, then / 2 may be evaluated, in the form 

(16) , by standard quadrature methods. If /(f) is singular at f = 0, then 
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the evaluation of / 2 might be reduced to the previous case of Subsection 

6 . 2 . 

In fact, a sufficient condition for / 2 defined in (15) to converge (absolute¬ 
ly) is that g(x) be continuous and that 

(18) lim x 1 + e g(.x) = 0 for some e > 0. 

JC-» 00 

This, by (17), is equivalent to 

lim f 1 -«/(£) = 0. 

i-o 

Now this condition will be satisfied if /(f) behaves at f = 0 like f -9 
where 9 < 1 — <r. If <r < 1, the integral in (16) may have a singularity 
of the form indicated in (2). 

Finally, we point out that special weighted Gaussian quadrature formulae 
may be effective for various integrals over infinite intervals. In particular, 
the orthogonal polynomials over [ — oo, oo] with respect to the weight func¬ 
tion e~* 2 are well known. They are called Hermite polynomials , H n (x ), 
and they can be shown to be given by (see Problem 3.18 in Chapter 5) 

(19a) tf n (*) = (-l)V* 2 ^(e-* 2 ). 

It is not difficult to deduce that 

(19b) H n + 1 (x) = 2xH n (x) - it H n (x), 

and hence by induction, since H 0 (x) = 1, that 
(19c) H n (x) = 2 n x n + ■ ■ ■. 


By repeatedly using integration by parts we can show that the normalized 
Hermite polynomials are 


(19d) 


HJx) 

V^nlir* 


The formulae based on the H n (x) are called Gauss-Hermite quadrature 
formulae and are used to approximate integrals of the general form 


/: 


e * 2 g{x) dx. 


For integrals over [0, oo], the Laguerre polynomials , L n (x), defined as 
L n (x) = (-l)"** (x n e~ x ), 


(20a) 
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are sometimes useful. They are orthogonal over [0, oo] with respect to the 
weight e~ x . It can be shown that 

(20b) L n (x) = 

and that the normalized Laguerre polynomials are (l/«!)L n (x). The 
Gauss-Laguerre quadrature formulae are based on these polynomials and 
are used to approximate integrals of the form 

poo 

I e - ■*#(■*) dx. 

Jo 


PROBLEMS, SECTION 6 

1. Evaluate 


f V 1 - x 2 dx 
Jo 

by Method II, using the composite Simpson’s rule, and obtain four-decimal- 
place accuracy. What is the largest interval h that is permissible? 

2. Use Method III and the composite Simpson’s rule to evaluate 

f V1 - x 2 dx 
Jo 

correctly to four decimal places. What is the largest interval h that is per¬ 
missible? 

3. Substitute the new variable x = cos 6 and use the composite Simpson’s 
rule to evaluate 


f Vl - X 2 dx 
Jo 

correctly to four decimal places. What is the largest permissible interval 
h = A0? 

4. Verify the properties of the Hermite and Laguerre polynomials given in 
the text [see equations (19) and (20)]. 


7. MULTIPLE INTEGRALS 

The problem of efficiently approximating multiple integrals numerically 
has not been completely solved. An obvious source of complexity is the 
variety of domains of integration in higher dimensions compared to just 
intervals in our study of one dimensional integrals. However, even if the 
domain is restricted, say to the unit cube, then the resulting problem is 
still not in a satisfactory state. A fundamental difficulty is essentially the 
great degree of freedom in locating the nodes or equivalently in the large 
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number of, say, uniformly spaced nodes required to get reasonable 
accuracy. 

One of the basic methods for approximating multiple integrals is, as 
in the one dimensional case, to integrate a polynomial approximation 
of the integrand. But since interpolation theory in higher dimensions is 
not well developed, we again have difficulty in devising practical schemes. 
Generalizations of the method of undetermined coefficients offer many 
possibilities, but only a few of these have been exploited for multiple 
integrals. Finally, we point out that the difficulties increase as the dimen¬ 
sion of the domain of integration increases. This seems related to the 
fact that the ratio of the surface area to the volume, for an ^-dimensional 
unit cube increases with n. We shall consider numerical methods for 
evaluating double integrals for the most part. Many of the procedures 
extend in an obvious way to higher dimensions, with perhaps a subsequent 
loss in efficiency or accuracy. Approximation methods for double integrals 
are frequently called cubature formulae since they approximate the volume 
associated with the integrand. 

In general, the problem is to approximate an integral of the form 

(1) J{f } = f f{x) dx, 

JD 

where x = (x u ..., x p ) and dx = dx 1 • ■ *dx v are a point and a volume 
element in the /^-dimensional space, respectively, and D is a domain in 
this space. The approximations considered are all to be of the form 

(2) J N {f} = 2 ^v/(x„). 

V = 1 

Here the N points, x v , are the nodes of the formula and the A v are the 
coefficients. We say that formula (2) has degree of precision m as an 
approximation to the integral (1) if /{F’(x)} = / N {P(x)} for all poly¬ 
nomials, P(x), in x of degreef at most m but not for some polynomial 
of degree m + 1. 

We cannot proceed as in Section 1 to study general interpolatory 
schemes since the general interpolation problem is not well posed in 
higher dimensions. However, if the nodes are specially chosen, say as in 
Section 6 of Chapter 6, then interpolation can be used and we consider 
such cases first. 

t We say P(x) is of degree at most m in x, if P(x) is a polynomial in (x x ,.. * p ) 

of the form 

«*)= I c lll2 h x>M 2 -..4 p - 

OSI l +l 2 + + t p zm 
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7.1. The Use of Interpolation Polynomials 

Let the integral in (1) be over a plane domain, say 

(3) J{f) = J jf( X y y) dx dy. 

D 

Let us pick as nodes the N — {m l)(/i + 1) distinct points: (Xj, y ; ), 
/ = 0, 1,.. m; j = 0, 1,.. n\ where the m 4- 1 distinct numbers {x t } 
and n + 1 distinct numbers { y j } are, at present, arbitrary. Then a poly¬ 
nomial P(x , y ), of degree m in jc and n in y which is equal to f{x , y) at 
these N nodes is given by (6.3) in Chapter 6. We use this polynomial to 
define the cubature formula 

(4a) J N {f } = JJ P(x, y) dx dy 

D 

m n 

= 22 A uf( x h y t )- 

i = 0 j = 0 

Here we have introduced the coefficients, A u , by the definitions 
(4b) A tj = J jx m . t (x) Y n , t (y) dx dy, 

D 

i = 0, 1,.. m, j = 0, 1,.. 

and the Lagrange interpolation coefficients X mti {x) and T n .((T) are defined 
in (6.2) of Chapter 6. 

While this procedure is formally valid for very general domains, D, 
it is only practical when the integrals in (4b) can be evaluated explicitly. 
A particularly simple and important special case is that of a rectangular 
domain, 

D : {x y y | a < x < b; c < y < d}. 

In this case we have 

( 5 ) = afi ,; a, = I * X m i (x) dx, = f Y n l {y)dy, 

and the quantities a { and are just the coefficients for appropriate one 
dimensional interpolatory quadrature formulae. Furthermore, if the num¬ 
bers Xi are equally spaced in [a y b], and the y j are equally spaced in [c, d] 9 
then the a f and are the coefficients in the (m + l)-point and (n + 1)- 
point Newton-Cotes quadrature formulae respectively (see Problem 1). 
The error in the cubature formula (4) as an approximation to the integral 
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(3) is, upon recalling the expression (6.7c) of Chapter 6 for the error in the 
interpolation polynomial, 

(6) E„{f} = JJ [/(*, y) - P(x, y)] dx dy 

D 

= J J R(x, y) dx dy 


-// 


(m + 1 )! 


<^iy) 

(«+ 1) 

<»m(x)a> n (y) 


l(w AX ' v(y)) 


(m + 1 )! (n + 1 ) 

X /(£'(*)> ■>?'(>'))]- dx dy. 


/ 8 \ m + 1 / d 
! \d-X7 \dy) 


We deduce from this that the formula (4) has degree of precision at least 
min (m, n). For instance, if as is frequently the case, we take m ~ n, 
then by using N — (« + l) 2 nodes, a formula with degree of precision 
at least n is obtained. 

However, a formula using only M = \(n + 1 )(n + 2) nodes can be 
devised which also has degree of precision at least n. For this purpose, 
we integrate the interpolation polynomial P n {x, y), given by (6.10) of 
Chapter 6, over D. The general result is somewhat cumbersome to write 
down in the form (2). First, divided differences of the type f[x Q ,..., x k ; 
y Q , must be expanded as linear combinations of the function 

values, f(x v , y u ), and then all terms containing such function values 
must be combined to determine the coefficients, B vu . For small values of «, 
say n < 3, this is easily done (see Problem 2). However, for equally 
spaced x k and y f difference operators may be employed to simplify the 
notation and even the calculations. We indicate the general formula 
obtained in this manner as 


(7) K M {f} = J J P n (x, y) dx dy 

D 

n n-k 

= 2 2/i** 

k = 0 j = Q 

X J f o> k - i(j) dx dy 

D 

= 22 5 v U /U v , y u ). 


v = 0 u = 0 
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The error in this formula is easily obtained from (6.11) of Chapter 6. 
We only wish to observe that it implies that (7) has degree of precision 
at least n. 

The nodes that enter into this formula are a subset of the rectangular 
array of (n + l) 2 points (x t , y ; ), i,j— 0, 1If the numbers x { and 
yj are monotonically ordered, say x Q < Xi < x 2 < ■ • •, then the nodes in 
(7) are those on and below the main diagonal of a schematic n + 1 by 
n + 1 matrix of dots (see Figure la). However, any other selection of 
points obtained by permuting rows or columns could also be employed. 
This just corresponds to a renumbering of the x { and y h say as in 
Figure lb. 

Both of the interpolatory cubature formulae (4) and (7) are easily 
extended to integrals in higher dimensions. As the dimension increases, 
there is a greater saving in number of nodes in extensions of K M {f } 
formulae compared to J N {f} formulae while maintaining precision of 
degree at least n. Thus, in the plane, J N {f} requires N = [n + l) 2 nodes and 
K M {f } requires M = %(n + 1)(« + 2) nodes for each to have degree of 
precision at least n. The ratio of the number of nodes required is 


M n + 2 _ 1 
N “ 2{n -h 1) ~ 2’ 


for large n. 


In three dimensions the ratio becomes 


M (n 4 - 1)(« 2 + 5n 4 - 6)/6 ^ 1 
~N ~ (n + l) 3 ~ 6’ 


for large n . 


7.2. Undetermined Coefficients (and Nodes) 

The general formula (2) can be written for double integrals as 

Jflif s = 2 ^ J'v)- 

V — 1 


(8) 
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We note that there are 3N parameters which determine such a scheme; 
the N coefficients, A v , and the 2N coordinates of the nodes, ( x v , y v ). As an 
approximation to the integral (3) the cubature formula (8) will have degree 
of precision at least n if for non-negative integers r and q : 


(9) 


2 A v x v r y v q 


v= 1 


J J x r y q dx dy , 

D 


r + q = 0, 1,..., w. 


There are \(ti + 1)(« + 2) conditions imposed in (9). Hence, there are at 
least as many free parameters in (8) as there are conditions in (9) if 


(10a) 

or 


N ^ (« + 0(« + 2) _ n 2 
6 ~ 6 


(10b) n < i(V\ + 24 N - 3) a V6N. 


We note from (10a) that the number of nodes for which a degree of 
precision n might possibly be obtained is about ^ the number used in the 
cubature formula (7) and about £ the number used in (4). 

This procedure is practical only if the integrals on the right-hand side 
of the equations in (9) can be evaluated explicitly (or perhaps if they can 
be accurately approximated with ease). This is, of course, the case if D 
is a rectangle or, in fact, any polygonal domain. However, the resulting 
system of \(n + 1)(« + 2) non-linear equations in 3N unknowns must also 
have a real solution with nodes in D. There are many special cases in 
which the procedure can be employed successfully. Let us consider, for 
example, the simple case of one node, N = 1. Then from (10) we find 
that n = 1 and there are only three equations in (9), namely, 



A iX x = J jx dx dy, 



Thus we find that the coefficient, A u is the area of the domain D and that 
the node, (x l9 j^), is at the centroid of D . The resulting cubature formula 

Mf} = AJ{x i, yi ) 


is exact for all linear integrands in (3). This derivation and formula trivially 
generalize to any number of dimensions. 

The next simplest case of only two nodes, N — 2, yields n = 2 by (10), 
and hence a system of six non-linear algebraic equations of the form (9) 
must be solved. However, it is easy to show that this system does not 
always have a solution (see Problem 5). Thus we cannot, in general, 
determine a two point cubature formula with degree of precision two. 
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For domains which have symmetry about the x- and j-axes, the analysis 
of the system (9) can be simplified if the nodes are required to be sym¬ 
metrically placed and have equal coefficients at corresponding locations. 
In this way various cubature schemes for integration over rectangles and 
circles may easily be derived. 

When a formula of the form (8) satisfies the conditions (9), and hence 
has degree of precision n or more, an expression for the error can be 
derived by analogy with the proof of Theorem 0.1. For this purpose we 
require that the integrand function, f{x, y ), have continuous partial 
derivatives of all orders up to at least the (n T- l)st. Then we can expand 
the integrand about some point (x 0 , j 0 ) into a finite Taylor’s series with 
remainder in the form 

/(*> y) = T n (x, y ) + R n (x, y). 

Here T n (x, y) is a polynomial of degree at most n and R n (x , y) is the 
known remainder which can be written symbolically as 

Rn{x ' y) = (« - T i)1 [ {x ~ Xo) lTx + {y - yo) U m v) - 
The error in the cubature formula is then 
E»{f} = J{f) - J»{f) 

= J{Tn) + J{K} - J N {T n } - UK) 

= J{Rn) - URnl 

Here we have used the fact that since the degree of precision is «, 
J{T n } — J N {T n }. In somewhat expanded form this error expression is 

(11) E » {f} = - Xo) Tx + iy - yo) U +1 

D 

X fit, v) dxdy - 2 a, 

V = 1 

x j^(x v - *o) ^ + (j'v - Jo) /(fv, 1 7v)j" 

The integrand in (11) is, of course, just symbolic since (£, rj) depends upon 
(X y) for purposes of the integration but not for the differentiations. 
Note that if the maximum distance from (x 0 , Jo) to any point in D or any 
node (jc v , y v ) is h then the error satisfies E N {f) = (9{h n + l ). In particular, 
if the coefficients A v are all non-negative , so that 
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and all ( n + l)st order derivatives of f(x , y) are bounded by M n + 1 , say, 
then with h as above we deduce from (11) that 

(12) \E s {f}\ < 2M n + 1 jjdxdy. 

D 

This estimate holds for any cubature formula which has non-negative 
coefficients and degree of precision n > 0, provided that the integrand has 
the appropriate smoothness. 

7.3. Separation of Variables 

Perhaps the most obvious way to devise approximations for multiple 
integrals is by the repeated use of one dimensional quadrature formulae. 
The domain, D , must be somewhat special, or else it must be the union of 
special subdomains, in order for us to apply this method of separation of 
variables. In two dimensions the restriction is that vertical (or horizontal) 
lines have at most one segment in common with D. Integrals of the form 
(3) can then be written as 

n y 2 (jc) 

/(*, y) dy dx, 

i i<*) 

where the segment y^x) < y < y 2 (x) is in D for all jc in [a, b]. If we intro¬ 
duce for each /(jc, y) a function of the single variable jc by the definition 

rv 

(14a) G(f; x) = f(x, y) dy, a < x < b, 

Jy i(*) 

then the double integral (13) becomes 

(14b) J{f } - K{G } = f G(f; x) dx. 

J a 

Now let us approximate the integral K{G } by some «-point quadrature 
formula with coefficients a y and nodes , x jf which all lie in [a, b ], say 

(15) K n {G) = ^ «/ Gif; x,). 

The numbers G(f ; x } ) which are required to evaluate this formula are given 
in (14a) as single integrals and hence can be approximated by applying 
other one dimensional quadrature formulae. For ease of presentation we 
use an m-point formula for each j and write the approximations to the 
G{f\ x,) as 

m 

G m (f\ x,) = 2 PfkA x i> yjk), J = i,2,...,n. 

k= 1 


( 16 ) 
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Here the coefficients and nodes y ik must, in general, depend upon the 
value x } since the interval of integration, [^(x), y 2 (x)] in (14a) depends 
upon x. By using (16) the value K n {G } of (15) finally yields the cubature 
formula 

n m 

(17) J mn {/} = 22 “>/W Xx„ yik ). 

y=l k =1 

This formula employs mn nodes and is somewhat similar to that given 
by (4a) with coefficients (5). If the domain were a rectangle, then the same 
m-point formula could reasonably be used in (16) for all j. Then in (17) 
we could replace j3 Jk and y jk by and y k) respectively, to get formal 
agreement with (4a) and (5). 

The error in the cubature formula (17) is defined as 

E mn {f} = J{f} ~ /«.{/}• 

Let us introduce for the quadrature errors in (15) and (16) the notation: 
e„{G} = K{G) - K n {G), 

(18) 

e mi {f} = G(f\ x,) - G m (f; x,), j = 1, 2,.. n. 

Then from (14b) we have 

(19) E mn {f} = K{G} - K n {G} + K n {G} - J mn {f} 

= e n {G} + J “AG(A x,) - G m (f ; x,)] 

j= 1 

= e n {G} + 2 a i e mAf )• 

i = 1 

It is interesting to note that when the degrees of precision of the quadrature 
formulae (15) and (16) are known we do not, in general, know the degree 
of precision of the cubature formula (17). This is because G(f; x ) is not 
generally a polynomial in x when /(x, y) is a polynomial. In fact, it is 
easy to see that J mn {f} may not even be exact for constant integrands 
when the quadrature formulae employed have arbitrarily high degrees of 
precision. 

If the bounding curves 7 i(x) and y 2 (x) of D are polynomials of degree at 
most s > 0, then lower bounds can be given for the degree of precision. 
^ f(Xy y) is a polynomial of degree p then, by (14a), G(f; x) is a polynomial 
of degree at most s(p + 1). So if (15) has degree of precision s(p + 1) 
and (16) has degree of precision p then J mn {f } has degree of precision at 
least p. Of course, if the domain is a rectangle, i.e., s = 0, then J mn {f} 




[Sec. 7.4] 


COMPOSITE FORMULAE FOR MULTIPLE INTEGRALS 361 


has a degree of precision which is at least the minimum of those for (15) 
and (16). 

From this result it follows that cubature formulae of high degree of 
precision with relatively few nodes may be devised if the domain of inte¬ 
gration is a rectangle. To get degree of precision n in such a case we use 
[(« + l)/2]-point Gaussian quadrature formulae as the two relevant 
schemes for (15) and (16). It is here assumed that n is odd and the total 
number of nodes required is then only + l) 2 . For large values of n 
this is about half the number of points that were required in the efficient 
interpolation quadrature scheme (7) with the same degree of precision 
(and about | the number required in Subsection 7.2 by the method of 
undetermined coefficients). However, none of the nodes in the Gaussian 
scheme can be on the boundary of the domain and hence its usefulness in 
composite cubature formulae is reduced. 

The extension to higher dimensions of the method of separation of 
variables is fairly clear. The restrictions on the domain are somewhat 
complicated, but, for instance, it is sufficient for the domain to be convex. 
In particular, for rectangular parallelopipeds, only a single one dimensional 
quadrature formula need be specified for each dimension. If the appro¬ 
priate Gaussian schemes are used in this case, we obtain degree of precision 
n (odd) by using only [{n + l)/2] p nodes in p dimensions. 

7.4. Composite Formulae for Multiple Integrals 

Just as in the case of one dimensional integrals, it may be necessary 
to decompose the integral (1) into a sum of integrals over smaller non¬ 
overlapping domains. That is, if D { n D j has no inner points for i ^ j and 

D — Di {J D 2 U • • • U 

then 

(20) J{f } = f /(x) dx = 2 f /(*) dx. 

J D i=i JDi 

If N nodes are used to calculate the integral over each of the primitive 
domains D { say 

J N {f D t ] = 2 a ij/( x t/)> 

; = 1 

then at most MN evaluations of /(x) are used in (20). We say at most 
because a node x of D x may also be a node of D } but /(x) need only be 
found once for such a node. 

If the region D is a p-dimensional rectangular parallelopiped, then a 
corner (or vertex) node of D { may also occur in as many as 2 P — 1 
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adjoining cells D Hence the amount of work necessary to evaluate the 
integrand may be minimized by selecting as many nodes as possible to be 
vertex nodes, then edge nodes, then face nodes, etc. For example, in two 
dimensions, the scheme of Figure lb is more efficient than the scheme of 
Figure la. 

If the cubature formula used in D iy J N {f ; Z)J, has degree of precision n , 
and has non-negative weights {a i; }, then just as in the derivation of (12), 


( 21 ) 


d x 


/(x) dx - 2 «i;/( x i>) 


( ph) n + 1 

< y Jn > - 2 M 

“ (« + 1)! 


n +1 | d\, 

J D\ 


when D t is contained in a cube of side 2h, and 


d n ■* 


dx^ * * • dx p j p 


/(X) 


< M n 


V 

for all x in D and all ( j k } satisfying 2 A = n + 1 • With these conventions, 

fc= i 

it is then a simple matter to derive the fundamental estimate of the error 
in the composite cubature formula, 


( 22 ) 


j{f) - t W; 


(ph) n +1 

< jl ———- 2M, 

(« + 1)! 


f 

71+ 1 

Jd 


dx . 


PROBLEMS, SECTION 7 

1. Devise the cubature schemes indicated by (4a) and (5) for equally spaced 
nodes when (l)w = « = 0;(2)w = 0, «= 1; (3) m = /? = 1; (4) m = n = 2. 
Case (4) is the generalization of Simpson’s rule and (3) is the generalization 
of the trapezoidal rule to integrals over rectangles. 

2 . Determine the general cubature schemes for n — 0, 1, 2 determined by 
integrating P n (x , y), given in (6.10) of Chapter 6, over an arbitrary domain D. 
Specialize these results for a rectangle a < x < b, c < y < d. Take uniform 
spacing in this rectangle in each case to simplify further. {Note: These schemes 
are not uniquely determined; see Figure 1.) 

3. Compare the schemes (3) and (4) of Problem 1 to those with n = 1 
and n = 2, respectively, in Problem 2 by approximating the integral 

J J — y 1 dx dy. 

Try at least two of the nodal schemes fcr each case of the methods of Problem 2. 

4. Determine the ratio M/N for the number of nodes required in four and 
five dimensions, to extend the formulae J N {f] and K M {f } of Subsection 7.1 
which have a degree of precision at least n. 




[Sec. 7] 


PROBLEMS 363 


5. Consider the case N = n = 2 in the equations (9). Show that the resulting 
system does not have a solution in general by considering the special case 

J J x dx dy ~ j j y dx dy - J J xy dx dy = 0. 

D D D 

[Hint: Introduce the notation 

AiXt = A x y x = 77, Ai = £, J J dx dy = f 0 

0 

and show that the system reduces to 

tq k + = 0> + = J7* 2 dx dy ' 

D 

Ai + i^i) = Si y2dxdy - ] 

D 

6. Give the proof of (22) in detail. 




8 


Numerical Solution of 
Ordinary Differential Equations 


0. INTRODUCTION 

In order to study the effectiveness of various methods for the numerical 
solution of differential equation problems, we illustrate the theory for 
the case of the general first order ordinary differential equation 

(la) l-MA 

subject to the initial condition 

(lb) y(a) = y 0 . 

It is required to find a solution, y — y(x), of the problem (1) in some 
interval, say a < x < b. Under suitable restrictions')* on the function 
f{x. , y), it is well known that a unique solution exists. 

The class of methods to be discussed uses a subdivision of the interval 
I = [a, b], by a finite set of distinct points 

(2) / A : x 0 - a , x i + 1 = x, + Ax*, i = 0, 1,.. N. 

Finer subdivisions also play a role and are denoted by the same generic 
symbol / A . In the present context, the set of points defining a subdivision 
is frequently called a net , grid, lattice , or mesh . The quantities Ax, are called 
the net spacings or mesh widths. Corresponding to each point of the net 
we seek a quantity, say u h which is to approximate y { = XxJ, the exact 

t For instance, existence and uniqueness of the solution are assured if f(x, y) is 
bounded, continuous in and Lipschitz continuous with respect to y in some 
sufficiently large rectangle R c ‘. [a < x < b, \y — y 0 \ < C], see equation (1.5). 


364 
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solution at the corresponding net point. The set of values {wj is called a 
net function. Clearly, the values {^} also form a net function on I A . We use 
the generic symbol {wj to denote a net function on any subdivision. 

For most of the methods treated in this chapter, the quantities {wj 
are to be determined from a set of (usually non-algebraic) equations which 
in some sense approximate the system (1); these approximating equations 
are called difference equations . The natural requirements for the approxi¬ 
mating difference equations are that for any function /(x, y) (in some 
class of sufficiently often differentiable functions): 

(a) They have a unique solution. 

(b) Their solution, at least for “sufficiently small” net spacings, should 
be “close” to the exact solution of (1). 

(c) Their solution should be “effectively computable.” 

Property (a) is trivially satisfied by many of the difference equations to 
be studied, the so-called explicit schemes . Whether or not the implicit 
schemes satisfy condition (a) is determined by a study of the roots of a 
sequence of equations (or systems), of the form z = g{z) (see Section 2). 
In general, if Ax t is small enough, the implicit equations have a unique 
solution. 

Property (b) is related to the question of the convergence, as 
max A x { —> 0, of {wj to {jyJ. The study of such convergence properties of 

i 

the difference solution shall occupy a considerable part of this chapter. 
In Sections 1 through 3 we examine separately the convergence of each 
of several special methods. In Sections 5 and 6 we give a general treatment 
of convergence which includes the previous cases. 

The vaguely formulated property (c) involves two important considera¬ 
tions: (i) the number of single precision computations required; (ii) the 
growth of roundoff errors in the computed difference “solution.” Of 
course, these two points are related since having to compensate for 
rounding errors by using more significant figures usually entails additional 
computations. A trivial first approximation of (i) is based on the operational 
count for infinite precision arithmetic. The growth of the roundoff error 
is related to the notion of stability of difference equations. The stability 
theory of difference equations treated in Section 5 is based on the study 
of difference equations with constant coefficients developed in Section 4. 
We establish the main general theorem of this chapter in Section 5 (i.e., 
stability is equivalent to convergence for consistent methods). 

There are a number of systematic ways in which one can “derive” or 
rather generate difference equations that approximate or are consistent 
with (1). That is, these difference equations seem to be discrete models 
for the continuous problem (1). But, no matter how reasonable the 
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derivation, the efficacy of such difference equations can only be determined 
by checking conditions (a)-(c). In fact, in Subsection 1.4 we derive a 
discrete model which seems quite reasonable, but is absolutely useless 
since the growth of the roundoff error cannot be controlled (i.e., it is 
unstable). 

Later, in Section 5, a simple criterion is developed for recognizing when 
a finite difference scheme is stable and convergent. 

It should be recalled that some of the numerical methods for approxi¬ 
mating solutions of ordinary differential equations, and systems of them, 
also have important theoretical applications. In fact, one of the basic 
existence and uniqueness proofs uses the Euler-Cauchy difference method 
of the next section. We resist the temptation to present such a proof here. 
Rather, we will assume that Problem (l) is “well-posed,” i.e., it has a 
unique solution with a certain number of continuous derivatives and 
furthermore, the solution depends differentiably on the initial data. As 
indicated in the footnote on page 364, we can guarantee the well-posedness 
of Problem (1) for a wide class of functions/( jc, y). We will be interested 
in showing that certain difference methods have properties (a)-(c) for such 
a class of functions, /(jc, /). 

At the present time there seems to be no general way of formulating an 
“ideal method” for solving (1). An “ideal method” is one which requires 
the least amount of work (number of single precision computations) to 
produce an approximate solution of (1) accurate to within a given e > 0. 

In the following sections a number of inequalities are derived with the 
use of two simple lemmas; 

lemma 1 . For all real numbers z: 

(3) 1 + z < e z , 

(where the equality holds only for z = 0). 

Proof. Since the function e z has continuous derivatives of all orders 
we have by Taylor’s theorem 

e z — l + 2 + y e 6z , 0 < 6 < 1. 

But the last term on the right-hand side is non-negative and vanishes 
only when z = 0 and so the lemma follows. ■ 

A simple corollary of this result is contained in 
lemma 1 '. For all z such that 1 -f z > 0, 

(4) 0 < (1 4- z) n < e nz , 

Proof. Obvious from Lemma 1. 


n > 0. 
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1. METHODS BASED ON APPROXIMATING THE DERIVATIVE: 
EULER-CAUCHY METHOD 

To illustrate the basic concepts we consider the simple difference 
approximation to (0.1a) which results from approximating the derivative 
by a forward difference quotient, 

(la) U I^- u < = f( Xjt u f ) t y = 0, 1. 

Here, for simplicity only, we have choser a uniform net 

(2) V x 0 = a; x, = x 0 + jh, j = 0,.. N; h= -jf - 
The initial condition (0.1b) is replaced by 

(lb) «o = yo + e 0 , 

where we have intentionally permitted the introduction of an initial error, 
e 0 . Equations (1) are the difference equations of the Euler-Cauchy method. 
This method is also called the polygon method , where the polygon is 
constructed by joining successive points (x y , My) with straight line segments. 
Each segment has the slope given by the value of/at the left endpoint. 

The existence of a unique solution u i of the difference equations follows 
from writing (la) as 

(3) u j+l = u i -1- hf(x j9 M y ), j = 0, 1 ,..N - L 

Then with u 0 , given in (lb), the above yields recursively u lf w 2 ,..m* 
provided only that /(x y , M y ) is defined. 

The present analysis of (3) is based on using infinite precision arithmetic. 
That is, the numbers that would be calculated in finite precision arithmetic 
satisfy U i+1 = [Uj + hf(x y, Uf] + p } + 1 where p j + 1 is the rounding error 
made in evaluating the term in brackets. Later on, in Subsection 2, we 
study the error, Uj — y y, as h -> 0. 

We now turn to a consideration of the error {e y } defined by 

(4) e j = My - y ji 7 = 0, 1,...,W 

For this study we require that, in some region S to be specified later, 
/(x, y) be continuous in x and satisfy a uniform Lipschitz condition in the 
variable y: 

(5) |/(x, y) —/(x, /)| < K\y — y* |, for some constant K > 0; 

all (x, y) and (x, y') e S. 

[If K — 0, then /(x, y) is independent of y and the problem (0.1) is a simple 
problem of quadrature treated in Chapter 7.] In addition, we will need a 
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measure of the error by which the exact solution of (0.1) fails to satisfy 
the difference relation (la). This error is called the local truncation error 
or discretization error and is defined by: 

T y+i = y ’ +1 h ~ y < - fix* y t ), j = 0, 1 ,..N — 1. 

This relation is frequently written as 

(6) y t+ i = y, + hAxj,y,) + hr j+1 , j = 0,1 ,..N - 1. 

An explicit representation of the r y will be derived shortly [under additional 
conditions on f(x , y)]. If the r j vanish as h -> 0, we say that the difference 
equations are consistent with the differential equation. But as will be seen 
in Subsection 2, for another consistent scheme, the corresponding differ¬ 
ence solution {Uj} can diverge as h -> 0, from the exact solution ( y y } even 
though e 0 is small. However, in the present case, we have 

theorem 1. Let {uj} be the solution of( 1) and >»(x) the solution of (0.1) 
where f(x, y) satisfies (5) in the strip S: [a < x < b, \y\ < oo]. Then , with 
the definitions (6) of {r y }: 

(7) |My - y(x,)\ < . j = 0, 1,.. N, 

where 

r s max |Ty|. 

Proof The subtraction of (6) from (3) yields 

e, + i = e, + h[f(x,-, My) - f(x„ y,)] - hr j + 1 , j = 0, 1 ,..N - 1. 
By means of the Lipschitz condition we deduce that 
\e i + 1 \ < (1 + hK)\e,\ + h~. 

This inequality yields recursively 

kr-trl < (1 + hKf |ey. x | + [1 + (1 + hK)]hr, 

< (1 + hK) 3 \e,. 2 \ + [1 + (1 + hK) + (1 + hKf]hr, 

<d 

where we have summed the geometric progression. Since K > 0, we may 
apply Lemma 0.1' in the form 

(1 + hK) } + 1 < e {f+1)hK = 
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(8) |e /+1 | < e K(x t +1 *o>[|e 0 | + 

and the theorem follows. ■ 

The simple bound of Theorem 1 shows that the error at any net point, 
x j9 will be small if both the initial error, e 0 , and the maximum local trun¬ 
cation error, r, are small. Now, the value of |e 0 | is determined by the 
accuracy with which the number y 0y the initial condition, is approximated. 
On the other hand, we may guarantee that r can be made arbitrarily small 
by picking h sufficiently small, if d 2 y/dx 2 is continuous in [a, b]. That is, 
from Taylor’s theorem, 


y(x, + h) = y(x,) + h 


h 2 d 2 y(xj + Ojh) 
2 dx 2 


0 < 6, < 1; j = 0, 1,..AT - 1. 

However, since _y(*) is the solution of (0.1), dy(Xi)/dx — f(x t , yi) and a 
comparison of the above with (6) yields 


(9) 


h d 2 y(Xj + Ojh) 
Ti+1 2 dx 2 


0 < 6, < 1, y-0, l,..., AT - 1. 


Using this representation of r j in Theorem 1 and the formula obtained 
from (0.1a) by differentiation 


d 2 y 

dx 2 


= fx(x, y) + fy(x, y ) 


dy 
dx ’ 


we obtain a result which may be summarized as the 

corollary. If,\ in addition to the hypothesis of Theorem 1, f x (x , y) and 
f y (x, y) are continuous in S , then 


where'f 


< e K 


>(|e 0 | +hjfj< «*<—>(|e 0 | 


+ h 


MA 
2 KJ 


m 2 


max 

a<x<b 


d 2 y(x) 

dx 2 


If e 0 = 0 or \e 0 \ < ah for some constant a, then as a consequence of the 
corollary, lim e j = 0, or more precisely, the maximum norm of the error 

{ej} is at most (9(h) and converges uniformly to zero since the rightmost 


t If in 5, | f x \ < P, \ f y \ < Q , and |/| < R, then j^~j < P 4- QR . Hence for such a 
class of functions /(*, >>), we find the a priori bound M* < P 4- QR. 
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bound is independent of j. Note that since f y is assumed continuous in the 
corollary, the condition (5) need not be postulated but can, in fact, be 

deduced if K = sup \f y \ is finite. 

s 

In general, the bounds on the error are usually tremendous overestimates. 
It is possible, however, to obtain more precise expressions for the error, 
essentially under the conditions of the corollary. These expressions are in 
turn not practical since they cannot be evaluated explicitly. But since they 
do have analytical significance we present 

theorem 2. If {e,} and {r y } are defined by (4) and (6) respectively , and 
f y (x , y) is continuous in S , then there exist numbers <f> t in 0 < <f> t < 1 such 
that 

i 

(10) e, = A u0 e 0 - h ^ A i.i T h i=l,2,...,N; 

/ — 1 

where A 0 ,o — 1 tmd 

' 0 j> i + 1 

(H) A t .,= 1 j = i, 

, a i- j.A { - 1 , / j < u <*t = 1 + hf y (x» y { + M). 

Proof The proof is similar to that of Theorem 1 but now the mean 
value theorem is used in place of the Lipschitz condition. Thus, from (6) 
and (3), 

e l + 1 = e { + h[f(x iy u t ) - f(x h y>)] - hr i + l 

= a(Ct - hr i + 1 . 

To show that the algebraic manipulations, in the recursive application 
of the above, yield quantities of the form (10) and (11), we proceed by 
induction. Then with / — 0 in the above 

e t = a 0 e 0 — hr 1 

” ^1,0^0 “ 

Now we assume (10) to be valid and use (11) to obtain 

< 

= Mi.o^o - h 2 Mi.y r j - ^ T i+i 
y= i 

t 

— A i + lt0 e 0 “ h ^ A i + i tJ Tj- — hA i + i t j + iT i + i 

j= i 
i + 1 

= ^i +1.0^0 — h 2 Ai + l'jTj. 

The induction is thus complete and the theorem follows. ■ 
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By restricting h to be sufficiently small the exact error expression (10) 
can be reduced to a simple form which has practical significance. We state 
this result as 

corollary 1. Under the hypothesis of Theorem 2 let 

d = inf/„(x, y ), D s sup f y (x, y) 

s s 

be finite and restrict h so that 

(12) 1 + hd > 0. 

Then for each i = 1,2, ..., TV, there exist three numbers q u and t { in 
the intervals 

(13) d < p t < D, d <q, < D, min r y < < max r y , 

l <>j^i i <y<i 

such that 

(14) e t = (1 + hptYeo - [^ + ^ --]f f ; i = 1,2,N. 

Proof We note that from (11) and (12) it follows that 

0 < 1 + hd < o£ ; < 1 -f- hD , j ' = 0, 1,..., N — 1. 

Then 

(1 + hd) 1 < A i 0 = oc 0 a i ■ • -«i-i ^ (1 + hD ) 1 , 
and hence there is a number p t in the interval [d , D] such that 

A,' 0 = (1 + hp t y. 

Now define the quantities 

(15) S - t A,„ „ - 2 (4g)r f 

The A tj are non-negative as a result of condition (12). Hence t h which is 
an average with non-negative weights of the r y , must satisfy condition (13) 
(see Lemma 1.1 of Chapter 7 which can be used to prove this assertion). 
We also note, using (11), that 

0 < (1 + hdy~ y < A i f = a y a y+ 1 • • < (1 + /z£>) 1_/ , j < i; 

A i.j = L 

Then from the definition of S { 

1 4" (l + hd ) + • • ♦ H- (1 + hdy ~ 1 < Si 

< 1 + (1 +*/>)+•■•+(! +^) i " 1 , 
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or by summing the progressions 

(i + hdy - i ^ c „ (i + hdy - i 
Yd - - Yd ’ 


i = 1,2 


But the functions [(1 + z)‘ — l]/z are continuous functions of z and hence 
there exist numbers q t , in the interval [d, D], such that 


(1 + hq,y - 1 
“ hq t 


i = 1,2,..., N. 


The corollary now follows by using the expressions for A i 0 , S u and t t 
in (10). ■ 

The form of the error given in (14) can be used to derive practical 
information in many cases, Clearly, if d and D are known, or can be 
estimated, we can obtain upper and lower bounds on the factors which 
multiply the two error terms (e 0 and ^). A more striking application occurs, 
however, when f(x, j>) is such that D < 0 (i.e., f y < 0). Now clearly, 
p { < 0, < 7 i < 0, and by the condition (12) imposed on h : 

0 < (1 + h Pi ) <1, 0 < (l + hq { ) < I. 


Then by Lemma 0.1' 

(1 + hpiY < e ihp i = e (x '~ a)p t, 
or since p { < 0, this may be written as 

(1 + hp^ < e“ lp * ( *«” a)l < 1. 

Similarly, 0 < (1 + hq,y < 1, and by taking absolute values in (14), we 
find 


corollary 2. If f y < 0, then the hypotheses of Theorem 2 and its 
Corollary 1 imply 


(16) 


| e .| < e |P,(*, a >l|£ 0 | 


D 


i — 1, 2,..., A. ■ 


This result shows that the initial error cannot grow if/ v < 0 and further, 
that the local truncation errors in this case contribute at most an amount 


t/\D\. 


1.1. Improving the Accuracy of the Numerical Solution 

We now improve upon the corollary to Theorem 1 by characterizing 
the (9{h) term in the error {e } ). 
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theorem 3. Let the solution y = >>(*) °f (0.1) have three continuous 
and bounded derivatives: let f yy {x , y) be continuous and bounded , and let 
the initial error e 0 in the difference solution {w t } of(l)be 

e 0 = i 0 h, 

where is independent of h. Then 

e* = h£(x f ) + 7 = 0, 

vv/iere f(x) w r/ie solution f of the linear problem 

= /*(*> - ±/(*), 

= fo. 

Proof As in the proof of Theorem 2, 

e i + i = a t e, - Ar, + 1 . 

But now in (9) and (11) we use the extra differentiability properties to 
obtain 

“l = 1 + hfy(X„ y t ) + hfyy(Xi, >>, + <f>A 

r i+ i = j y"{x t ) + ~ y m (x t + e t 'h)0ih, o < fa, <j>{, e u 0{ < 1 . 

Then from this, 

e t + i = [1 + hf y ( Xi , y t )]e t - j fix,) + h[G{e?) + G(h 2 )]. 

By using the differential equation which defines f(x), Taylor’s expansion 
yields 

«*. +1 ) = e(xd + new + j a* + m ) 

= [1 + hfy(x u y<Mx <) - \y'\x t ) + GW). 

We now form the quantities 

Sy s e, - h{, 

and find 

8 1 + 1 = [1 + hfy( Xl , *)]8, + h[G(e t >) + G(h 2 )], i= 1,2,.... 

f Under the hypothesis, £(x) exists and has a continuous second derivative. In fact, 
£(x) can be explicitly represented by quadratures. 
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Now we observe, as remarked after the corollary to Theorem 1, that 
\e t \ = ®{h). Hence we may delete the term ®(e { 2 ). But, from the specific 
initial conditions chosen for f(x) we have S 0 = 0, and hence a recursive 
application of the formulae for 8* yields, as in the derivation of (7), 

18,| < e K(b ~ a) Iff) = 0(h 2 ) 

and the theorem follows. ■ 

To apply Theorem 3, we introduce the notation ufh) to indicate the 
dependence of the numerical solution on the net spacing. Then the theorem 
states that with, say Xj = z, 

u,(h) = y(z) + h£(z) + 0{h 2 ). 

Similarly, with the net spacing A/2 and x 2j — z, we have 

«2y(^) = y(z) + ^ i(z) + &(h 2 ). 

Then 

2- u,(h) = y(z) + 

and an extra order of magnitude in accuracy is obtained if we use as the 
difference approximation, at any point x j — z of the net with spacing /*, 
the quantity 



This requires computations with two nets of spacings h and hf 2 respectively. 
It should be observed that the formula for Uj is similar to the formula 
which arises in Aitken’s 8 2 -process in the iterative solution of arbitrary 
equations (see Subsection 2.4 of Chapter 3). In the present context this 
procedure is called Richardson's deferred approach to the limit , or extra¬ 
polation to zero mesh width . This extrapolation may be applied, in an 
appropriately modified form, to many of the numerical methods to be 
considered here. 

1.2. Roundoff Errors 

In actually performing the calculations required to evaluate (1), round¬ 
off errors will, in general, be introduced. Thus the numbers actually 
obtained will not be the set {wj but, say, some quantities {[/*}. These num¬ 
bers satisfy equations of the form 


(17a) U ( + 1 = U t + hf(x„ Ud + Pt + 1 , 


i = 0, 1; 
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where p i + 1 represents the error introduced by inexact evaluation of the 
quantity U t + hf(x u U { ). The p { are called the local roundoff errors . If we 
let p 0 be the initial roundoff error committed in evaluating y 0l then the 
initial condition becomes 

(17b) Uq = y 0 + po- 

Let the errors between the U t , the actual numbers obtained in the compu¬ 
tation, and the u h the exact solution of the difference equations, be denoted 
by 

(18) « = C/* — «*, / = 0, 

Then from (1) and (17) we obtain 
e 0 — Po “ e o\ 

(19) 

f i +1 = «| + h[f(x t , U,) - f(x„ Mi)] + Pi + 1 . / = 0, 1,. . N - 1. 

It is clear that these equations for <r t are formally similar to those which 
determine the quantities e t . In fact, the previous theorems and corollaries 
can be restated in an obvious way to give bounds and representations of 
the errors <r*. We shall return to the study of the growth of the in Section 
5. But, we now consider the more important total errors 

(20) E { = U t - y(x t ) = e f + c„ i = 0, 1,.. N, 

between the actual numerical solution, U i9 and the exact solution of the 
differential equation, j>(-*t)- I n an obvious manner, we find that 

£0 = P0» 

(21) 

£, + !=£■, + h[f(x t , Ui) -f(x ,, X*())] “ ^ T i + i - Pi + i). 

i = 0, 1. 

Again we may prove the analogs of the previous results. 

theorem 4. Under the conditions of the corollary to Theorem 1, we find 
that the error (20) satisfies 

(22a) \Ej\ < e™- a '[\ Po \ + + j)], for) = 0, 1,.. N, 

where the roundoff errors p t are defined in (17) and 
(22b) p = max \p t \ y M 2 = max 

1 <i<N a<x<b 


d 2 y(x)\ 


dx 2 




376 


ORDINARY DIFFERENTIAL EQUATIONS 


[Ch. 8] 



Figure 1. Comparison of truncation and roundoff error bounds as functions 

of h. 


The dependence of this error bound on the net spacing, h , is illustrated 
in Figure 1. 

Clearly, the choice of h for which the bound in (22) is a minimum is 
obtained when 

(23) or h = V 2 7[W 2 . 

For this optimal value of h, 

*"• + £-va** 

In many calculations performed on electronic computers p « M 2 , and so 
the “optimal” value for h will be unnecessarily small and need not be 
employed. Furthermore, the bound (22) indicates that for fixed h no 
greater accuracy is obtained by reducing the roundoff error so that 


in fact, any extra labor required for such computational precision is 
essentially wasted. If the relation (23) is approximately satisfied there might 
be some fortuitous cancellation of local roundoff and truncation errors. 

On the other hand, we remark that (22) establishes the convergence as 
h -> 0 of the Euler-Cauchy method (17) if the rounding error satisfies 

IaI < P = for i = 1,2,..., N 
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|£o| = I Pol = £o h. 

In fact, under these circumstances 

|£y| = rn; j — 0, 1,..., N. 

We leave to Problem 2 the proof of the validity of the Richardson extra- 
polation to zero mesh width provided p = ®(hp). 

1.3. Centered Difference Method 

To obtain greater accuracy with a fixed mesh size we seek difference 
approximations with smaller local truncation errors. One such modifica¬ 
tion of (la) is suggested by attempting to approximate the derivative at 
x ( by a more accurate expression than the forward difference quotient. 
We shall briefly examine here the use of the centered formula 

(24a) = f(x h ud, i=l,2,...,N-l. 

However, in order to use these difference equations to compute { Uj }, two 
starting values are required, say 

(24b) u 0 = y 0 + e 0 , «i = Ti + ei. 

The first value u 0 is again the approximation of the exact initial data, 
while Ui should be determined such that \ei\ is “small.” This could be 
done by employing the Euler-Cauchy method in the interval 0 < x < h 
with some smaller spacing, say h' — h/N' with N' > 1, or by developing 
the Taylor’s series 

, , h 2 „ 

Ml = ^0 + «To + y To , 

with 

yo = f(x 0 , y<>\ yo" = /*(*o, y«) + yoL(x 0 , y 0 \ 

However, this problem or similar ones will occur again and shall be 
discussed in more detail later. 

The truncation error in (24a) is now defined by 

(25) j'. + i = J'i-i + 2hf(x u y t ) + 2hr i + 1 , i = 1, 2,.. N - 1. 

Let us make sure that >’(x) has three continuous derivatives by assuming 
that f(x, y) has continuous second derivatives. Then by Taylor’s theorem 

h 2 h 3 

Ji + i = y{x i + h) = y t + hy,' + -j y” + jj/Xfi + i). x , < & + 1 < x i + 1 ; 

fi 2 /j 3 

yt-i = Axi - h)=y t - hy + y y" ~ jj/’te-i). x t . x < &_! < 
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These equations imply 

/z 3 

y i+ i= y t -i + 2hy,' + jj [/*(£+ 1 ) + 

and since y/ = f(x { , y *), a comparison with (25) yields 


(26) 


„ * 2 /Xft + i) T/'to-i) 
t + 1 6 2 



/ = 1, 2,-AT - K 


Here we have used the continuity of y m (x) to replace the average of third 
derivatives by an intermediate value at Hence the local truncation error 
of the centered scheme is smaller than the truncation error (9) of the Euler- 
Cauchy method as h -> 0. The present r x vanish to second order in h and 
we thus call the centered scheme (24a) a second order method. 

To show that these effects are indeed relevant for the convergence of 
the finite difference solution, we again consider the errors e t = u { — y(*f) 
and find from (24a) and (25) 

e l + 1 = e,_! + 2h[f(x„ u t ) - f(x„ j>(x,))] - 2hr i + 1 


To demonstrate convergence, let us introduce the bounds 


N — 1. 


(27) 


K > 2 


M 3 >\y”(x) I, t = y M 3 > 2|r,[. 


Hence by taking absolute values 

(28) |e t + l | < hK\e t \ + |e,_j| + hr; i = 1, 2,.. N - 1. 

To obtain bounds on the \e t \ we introduce a comparison or majorizing 
set of quantities {aj defined by 


(29) 


a 0 = max (|e 0 |, |ei|), 

a i + 1 = (1 + hK)di + hr, / = 0, 1, . . ., AT — 1. 


From the definition it is clear that a 0 > \e 0 \. We will show by induction 
that a ; > |e ; |, j — 0, l,. . N. Assume that a, > | e,-\, j = 0, 1,.. i. 
Equation (28) yields 

|e i + 1 | < hKdi + + hr 

< (1 ~\~ hK)di 4- Ht 


— ^t + i* 




[Sec. 1.4] 


HIGHER ORDER TRUNCATION ERROR 379 


Here we have employed (29) and the obvious relation > a i ^ 1 . Hence the 
induction proof is complete and a j > \e^\ for j — 0, 1,.. N. However, 
by the usual recursive application of (29) we now obtain 

(30) \e N \ < a N 


= (1 + hK) N a 0 + 


(1 + hK) N - 1 
- K - T 


< (1 + hK) N (a 0 + ^ 

< c TO - a) |max (|c„|, 1^1) + h 2 ^J- 

By comparing (30) with the result in the corollary to Theorem 1, we 
now find that the error is of order h 2 if the initial errors, e 0 and e u are 
proportional to h 2 . So in the present centered scheme, the higher order 
local truncation error (26) is reflected in faster convergence of {w<} to 
{t(*<)} as h -> 0. We might naturally expect this to be the case in general, 
and so seek difference equations with local truncation errors of arbitrarily 
high order in A. However, as is demonstrated in the next subsection, this 
expectation is not always realized. 

It should be mentioned that roundoff effects can also be included in the 
study of the present centered scheme. 

theorem 5. Let the roundoffs Pt satisfy 

= yo + po> — yi + pii 

where 


U i + i = U t .i + 2 hf(x u Ut) + p i + u for / = I, 2,.. AT - 1, 
max \ Pi \ = p. 

2 <i<N 

Then E t ~ U { — y t can be bounded by 
|£i[ < e K ’ < ' ,_0) (max (| Po |, [^j) + j^ 2 + ^]}’ i = 0, l,.. N, 

provided (27) holds, ■ 

Now the error is at least of order A 2 , if the maximum roundoff error 
satisfies p = 0(A 3 ) as h —> 0, while p 0 = 0(A 2 ), Pl = 0(A 2 ). 


1.4. A Divergent Method with Higher Order Truncation Error 

To demonstrate the care which must be taken in generating difference 
schemes, we consider a case with third order local truncation error but 
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which is completely unsuitable for computation. The basis for this 
scheme is an attempt to approximate better the derivative at x iy and 
thus to obtain a local truncation error which is higher order in h. For 
this purpose we consider a difference equation of the form 

(31) a 1 u i + 1 + a 2 u { + a 3 «i-i + <^-2 = f(*u ^), 

i = 2, 3.JV — 1; 

and seek coefficients a l9 . .., such that the local truncation error is of 
as high an order, in h, as possible. This is essentially the method of un¬ 
determined coefficients applied to the problem of approximating derivatives 
(see Subsection 5.1 of Chapter 6). That is, recalling /(jq) = f(x i9 ^(jq)), 
we define r t by 

(32) «iX*i + i) + a 2 y(x,) + a 3 y(x,_ i) + a 4 y(x ,_ 2 ) - y'(x,) = t 1 + 1 . 
Then if y(x) has four continuous derivatives, Taylor’s theorem yields 

y i+ 1 = X*. + h)-y t + hy\x t ) + hY(x t )/2 + /j 3 X(x ( )/3! 

+ /»V v (f. + i)/4! 

(33) y t -i = y(x, - h) = y, - fiy'(x,) + h 2 y"(x t )/2 - h z y m (x 0/3! 

+ AV v (f.- 0/4! 

T,- 2 = X*. - 2 h) = y, - 2 A/(jcO + 4hY( Xi )l2 - WjT(xdl3l 
+ 16fcV v (fi_a)/4! 

Forming the sum indicated in (32) and requiring as many terms as possible 
to vanish, we find 

fll + % + ^3 + #4 =0 

(a x 4- 0 — a 3 — 2 a^)h — 1 

(tfi + 0 + a 3 + 4a 4 )/z 2 = 0 

(a x + 0 — a 3 — 8 a^)h 3 = 0. 


This system of linear equations has the unique solution 


(34) 


° l ~ 3h’ 





Now (32) can be written as 

(35) y(x, + 1 ) = -fX*i) + 3y(*,- 1 ) - iX*i- 2 ) 

+ W(x t , y(x ,)) + 3hr i + 1 

where the local truncation error is from (33) and (34) in (32) 

(36) T t+1 = [iy lv (£ + i) - y v (fi- 1 ) + fX v (f,- 2 )]. 
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The difference equations (31) become 
(37a) u i + 1 = -\u t + 3 w,_ x - iw f _ 2 + 

i = 2, 3 f N — 1; 

and to these must be adjoined starting values, say, 

(37b) u 0 = y 0 + e 0 , w i = Ti + *i, u 2 = v 2 + e 2 * 

As before, the “extra” values u x and u 2 would have to be obtained by 
some other procedure. However, we proceed to show that this scheme, 
which has third order truncation errors, does not converge in general. 

Let us first consider a case in which f y is continuous and 

< K, r = /i 3 A /4 > max |r t |. 

t 

Then with e t = u x — we obtain, from (35) and (37), in the usual way 
ki + i| ^ (I + 3/i K)\ei\ + 3|e t _i| + ^|_ 2 1 + 3 At. 

If we introduce a majorizing set {aj, analogous to (29), we find that 

\e N \ < (5 + 3hK ) N [max (|e 0 |. ki|, N) + 4 ^ 3 ;^ ] ’ 

< 5 N e 3K(b ~ a) ' 5 |max (|e 0 1 , | Cl |, \e 2 \) + 

= 5 (l> “ aWl e 3m ~ a),s [max (| e 0 1 > N, \e 2 \) + ^ 

However, as h -> 0 this bound becomes infinite. While this does not 
prove divergence, we strongly suspect it. 

To actually show that the third order scheme (37) cannot converge in 
general, we apply it to the special case where f(x , y) = — y; i.e., to the 
equation 



whose solution y = y 0 e~ {x ~ a) satisfies y(a) = y 0 . 

Now (37a) can be written as 

(38) m ( + 1 + (| + 3h)Ui - 3«,_! + }w,_ 2 = 0, / = 2, 3,- 

This is a linear difference equation with constant coefficients (see Section 4) 
and it can be solved exactly. That is, we seek a solution of the form 

Uf = a 1 , j = 0,1,.... 


d l 

8y 
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But then from (38), 

[a 3 + (f 4- 3 h)a 2 - 3« + - 0, i = 2, 3. 

Thus, in addition to the trivial solution, = 0, we have three solutions of 
the form w y = a v j , where a v for v = 1, 2, 3 are the roots of 

(39) a 3 4- (f + 3h)a 2 - 3a + f = 0. 

It is easily verified that, for h sufficiently small, these three roots are 
distinct. 

It is easy to check that a linear combination 

(40) Uj = A !<*]/ + A 2 oc j 4- A 3 a 3 \ 

is also a solution. The coefficients A v are determined from the assumed 
known data for j — 0, 1, 2, by satisfying 

A\ 4- A 2 + A 3 = u 0 

A 1 a 1 4- A 2 CC 2 4 A 3 OC 3 — Ui 

A i«i 2 4- A 2 ^ 4 A 3 a 3 2 = w 2 . 

Since the coefficient determinant is a Vandermonde determinant and the 
a v are distinct, the A v are uniquely determined. Then the u ; for j > 3 
are also uniquely determined by (40). 

Let us write 

p{ a, h) = a 3 H- (f 4- 3h)ct 2 — 3a + f, 
and denote the roots of p( a, /*) = 0 by a v (/z). Since 

/>(«, 0) s (a - l)(a 2 + fa - f) 

we have, with the ordering a r < a 2 < a 3 , 

5 4 - a/33 a/33 — 5 

«i(0) - ^ a 2 (0) = V 4 ^ « 3 (0) = L 

For h sufficiently small, |a v (/i) — a v (0)| can be made arbitrarily small and 
a i{h) < — 2. So the solution (40), for large y, behaves like 

Uj » ^iM*)]' 

or in particular for jc n = b , 

Thus as -> 0 the difference solution becomes exponentially unbounded 
at any point x N = b > a. Furthermore, notice that a Y (h) is negative, and 
hence u j oscillates. This behavior is typical of “ unstable ” schemes (see 
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Section 5). Of course, we have assumed here that the initial data is such 
that A x 7 ^ 0. In fact, 



Wo 

1 

1 

det 

Wi 

«2 

«3 


U 2 

« 2 2 

«3 2 


1 

1 

1 

det 

«1 

a 2 

«3 


« 1 2 

« 2 2 

a 3 2 


Hence if 

Ui = u 0 + ch , u 2 = u 0 4 - dh , 


then in general, it follows that 

A 1 = hp + €{h% p * 0 . 

This is based on the fact that a t (h) can be developed in the form 

*ih) = «,( 0 ) + **,'((» + <9(h 2 ), 

(For example, in the exceptional case 

u 1 — a 3 u 0 and u 2 = a 3 2 w 0 , 


the quantity A x = 0 .) 

For the actual calculations of the quantities U j9 the local roundoff error 
at any net point Xj will set off such an exponentially growing term. Hence 
this method is divergent! 


PROBLEMS, SECTION 1 

1. If/(*, y) is independent of y, i.e., the Lipschitz constant K = 0, show that 
the error estimates of Theorem 1 and its corollary are respectively 

(a) \e,\ < |e 0 | + \x, - x q \t, j > 0; 

(b) \e,\ < kol + h\Xj - x 0 \M 2 H, j > 0. 

2. Carry out the proof of the validity of the Richardson extrapolation to 
zero mesh width for the Euler-Cauchy method defined in (17) with rounding 
errors p 0 = fo h and for / = 1 , 2,.. \p t \ < p = (P(h 3 ). 

3. Find the coefficient ^'(O) in the expansion 

a t (h) = a t (0) + ha t \ 0) + <9(h 2 ) 

by formally substituting this expression into (39) and setting the coefficient 
of h equal to zero. 
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2. MULTISTEP METHODS BASED ON QUADRATURE FORMULAE 


The study of the divergent scheme introduced in (1.37) shows that more 
accurate approximations of the derivative do not lead to more accurate 
numerical methods. But a way to determine convergent schemes with an 
arbitrarily high order of accuracy is suggested by converting the original 
differential equation into an equivalent integral equation. Thus by 
integrating (0.1a) over the interval [ a , x] and using (0.1b) we obtain 

(1) y{x) = y 0 + J fiL XO) d£- 

Clearly, any solution of (0.1) satisfies this integral equation and, by 
differentiation, we find that any solution! of (1) also satisfies (0.1a) and 
(0.1b). With the subscript notation, = X**)* the solution of (0.1) or 

(1) also satisfies 

(2) y i+ i = y t - P + f ‘ + 1 f{x, y(x)) dx. 

JX | - p 

This is obtained by integrating (0.1a) over x i + l ] for any / = 0, l,... 

and any p — 0, 1For a given choice of p a variety of approxi¬ 
mations are suggested by applying quadrature formulae to evaluate the 
integral in (2). The number of schemes suggested in this manner is great, 
but in practice only relatively few of them are ever used. 

We shall limit our study to the case of uniformly spaced net points and 
interpolatory quadrature formulae. In order to classify these methods 
in a fairly general manner we distinguish two types of quadrature formulae: 

Type A. Closed on the right , i.e., with n + 1 nodes 

x i + l> x l-> x i - 1? • • x i + l-nl 

or else 

Type B. Open on the right , i.e., with n + 1 nodes 

X ii X i- 1? ♦ • x i-n • 

The difference equations suggested by these two classes of methods can 
be written as 

n 

(3a) u i + 1 = i/,_ p + h 2 + Ki + w); 

1 = 0 
n 

(3b) w i +1 = u t -p + h 2 PJ{xt-j, Hi-,). 

; = 0 

t It is easy to see that any continuous solution of (1) is differentiable. This follows 
since the right-hand side of (I) is differentiable, if f{x,y) and y(x) are continuous. 
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We have denoted the coefficients of the quadrature formulae by ha j and 
hfij and note that they are independent of i . If the net spacing were not 
uniform, the coefficients would depend on /, in general, and the schemes 
would require the storage of more than n + I coefficients. This is one of the 
main reasons for the choice of uniform spacing in methods based on 
quadrature formulae. 

When the integers p and n are specified, the coefficients are determined 
as in Subsection 1.2 of Chapter 7. We see from (1.15) of Chapter 7 that 
the quantities a j and are independent of h , the net spacing. It also follows 
from Theorem 1.3 of Chapter 7 that the interpolatory quadrature formulae 
in (3) have the maximum degree of precision possible with the specified 
nodes; this is reflected in their having the smallest “truncation error,” 
defined later in equation (8a and b). 

In order to compute with a method of type A (closed on the right) we 
must have available the quantities u iy ..., w i + 1 _ n and tq_ p . Thus the points 
x i + 1 for which (3a) may be used satisfy 

(4a) i > max (n — 1, p) = t, 

provided that u Qy u u ... y u t are given. Similarly, method B requires 
Uu ..«i-„ and w { _ p , so that the points x i + 1 for which (3b) can be used 
satisfy 

(4b) i > max ( n , p) = r, 


provided that u 0 , u u . . ., u r are given. Special procedures are required to 
obtain these starting values in either case (see Section 3). 

The fundamental difference between the open and closed methods is 
the ease with which the difference equations can be solved. There is no 
difficulty in solving the equations based on the open formulae. In fact, 
since formula (3b) is an explicit expression for w i + 1 these are called 
explicit methods . But, the closed formulae define implicit methods since 
the equation for the determination of u i + 1 is implicit.! That is, (3a) is 
of the form 


(5a) 

where 

with 


«i + i = £i(Wi + i)> 


gi(z) = c, + ha 0 f(x i + 1 , z). 


Ci = u,_ p + h ^ a jf( x i-n-j> Wf + w) 


t In the special case that f(x , y ) is linear in y , i.e.,/(jc, >>) = a{x)y + b(x) y the implicit 
equation (3a) is easily solved explicitly for // ( + 1 if 1 - ha 0 a(x i + i ) ^ 0. 
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(5b) 


dg,(z) = hao 8f(x i + 1 , z ) 


dz 


dz 


Clearly the method of functional iteration is natural for solving (5a). 
By Theorem 1.2 of Chapter 3 we know that, if a “sufficiently close” 
initial estimate of the root is given, the iterations 

uWi" = gM v } i), * = 0 , 1 ,..., 

will converge provided h is sufficiently small, e.g., 


( 6 ) 


h < 


1 


I “0*1 


K = max 


8f(x, y) 


dy 


On the other hand, we may apply Theorem 1.1 of Chapter 3 to show 
existence of a unique root in the interval [c t — p, c x + p]. That is, by 
selecting wS^i = we have 


wv. 


K| + l| < ha 0 M 


where M 


max \f(x , y )|. Hence for any 
ha 0 M 


P ^ 


1 - ha 0 K 


we have an interval in which a unique root of (5) exists. We shall see that 
it is not necessary to find the root of (5) in order to preserve the high accuracy 
of the method. In fact, the predictor-corrector technique, which we next 
study, uses only one iteration of (5) without a loss in accuracy. 

The predictor-corrector method is defined by 


(7a) M,* +1 = M,_, + h 2 Piftxi-j, Ui-y); 

i = 0 

n 

(7b) w l + 1 = + h 2 a kf(x t + i_ fc) w, + 1 - fc ) + ha a f(x i + u wf +1 ); 

k = 1 

(7c) i > s = max [/?, q, m,n — 1]. 

Here, an m + 1 point quadrature formula open on the right has been used 
[in the predictor (7a)] to approximate an integral over [x t - q , x i + 1 ]. The 
closed formula (7b) (called the corrector) is similar to that of (3a) but 
w* + ! has been used in place of +1 in the right-hand side. Thus as previously 
indicated, the corrector is the first iteration step in solving the implicit 
equation (5) with the initial guess furnished by the predictor. Hence only 
two evaluations of the function f(x , y) are required for each step of the 
predictor-corrector method; i.e.,/(x { , u t ) and f(x x + u u* +l ). 
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The procedure (7) can only be employed after the values « 0 , u u .. u s 
have been determined. Here s , defined by (7c), is determined from the 
open and closed formulae (7a and b). To compute the u { , i < s, we refer to 
the procedures of Section 3. 

It is clear, from the analysis given in the remainder of this section, that 
the predictor-corrector method (7) has the same order of accuracy as the 
implicit method of type A defined by the corrector (3a), provided that 
the explicit predictor is sufficiently accurate. In other words, we avoid the 
necessity of repeatedly iterating the corrector, as in (5), by using a good 
initial approximation. We shall first develop estimates of the error in 
solving (1) by the predictor-corrector method (7). Next, we shall show 
how to modify the error estimates to cover the case of finite precision 
arithmetic, i.e., with rounding errors. Finally, we indicate briefly how the 
error estimates may be derived for the methods (3a) or (3b). 

The predictor-corrector method has the advantage of permitting the 
detection of an isolated numerical error through the comparison of w* +1 
with u i + 1 (or t/ c * +1 with U i + 1 defined later). 

The truncation error of (7) is obtained as follows. We define crj* +1 
and cri + 1 in terms of the exact solution y — ^(x) by 

m 

(8a) y t + 1 = y t - q + h 2 &/(*<-* tt-/) + 

y = o 

n 

(8b) y t + 1 = y t - P + h 2 «*/(*« + 1 -». yt + i-k) + hot + i- 

k = 0 

Then with the definition 

(8c) y? +1 = y i + l - ho? +li 

the local truncation error , r i + u of the predictor-corrector method (7) is 
defined by 

n 

W )>i + l = yi-p + h 2 a kf( x i + l~k> yi + l-k) 

1 

+ /za 0 /(*i + 1 , y *f i) + hr i + u i > S. 

To obtain a more explicit expression for this error we subtract (9) from 
(8b) and use (8c) to get 

(10) T i + i = ffi + x + <*olf(x i + u y i + 1 ) - f(x i + u y i + 1 - haf +1 )] 

i * ¥ 

= °i + i + ho? +1 a 0 — 

Here we have assumed f y to be continuous in y and used the mean value 
theorem; J y is a value of f y at some intermediate point in the obvious 
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interval. The quantities hof+ x and h<r i + 1 are the errors in the (m -f 1)- 
and (n 4* l)-point quadrature formulae, (8a and b). Explicit expressions 
for these quadrature errors have been obtained for various cases in Section 
1 of Chapter 7. It should be noted from (10) that the order as h^O of 
the truncation error in the predictor-corrector method is the same as the 
order for the corresponding closed formula used alone, provided that the 
order of ha* is not less than the order of a. Table 1 has a brief listing of 
commonly used predictor-corrector schemes. 


Table 1 

Table 1 Some Common Predictor-Corrector Methods 


Associated 



Weights 





Name 


0 







m y q\n,p j = 

1 

2 

3 

4 

a* 

a 

Modified 

0,0; 1,0 ft = 

1 





ihy™ 

-A*V 3) 

Euler 

0£, = 

i 

i 






Milne’s 

2, 3; 2, 1 fte 

£ 

3 

-t 

t 





Method 
(3 points) 

Ctj = 

i 

4 

3 

i 



tt*v a> 

-AAV" 

Improved 

3, 0; 3, 0 ft = 

5 5 

24 

5 9 
24 

il 

_ A 




Adams, or 
Moulton’s 
Method 
(4 points) 

a ! = 

A 

1 9 
14 

s 

14 

■A 




Milne’s 

4,5;4,3ft = 

33 

2 1 

5 

39 

~r 

2 1 
— 5 

H 



Method 
(5 points) 

«/ = 

if 

64 

4 5 

a 

« 

1 4 
4 ? 

AsAV” 

-*fy AV 7 ’ 


2.1. Error Estimates in Predictor-Corrector Methods 

To examine convergence of the scheme (7) we introduce the errors 
(11) e t = u { — y„ e t * = u { * - y t *. 

Then subtraction of (9) from (7) yields, if f y is continuous, 

n 

e i + l = e i-p + h 2 a kSi + l-k e t + i-k T- ha ogi + 1 eT +1 — hr i + 1 . 

fc -1 
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Here we have used the mean value theorem to introduce 

rtf 

8 f = Jy (■*/> Pj)> Pi e (JV> «;) for j / ' + 1 , while 
Ji + i e (y* + i, «*+ 1 ). 

However, from (7a) and (8a and c) we obtain 

m 

e* + i = ei- q + h ^ Pjgi-A-n 

i = o 

and using this in the above implies finally 

n 

(12) e 1 + 1 = ^ 2 +1-k^t +1-fc “b h a ogi + i e i-q 

k= 1 

m 

+ h 2 a 0 g i + 1 2 Pjgl-A-1 - hr 1 + 1. ' S 

y = o 

To estimate these errors we introduce 


(13) 


a: 


max 


£ 


r = max | T ; -|; 


/l = 


2 Kii 


fc= o 


m 

IAI. 

/ = 0 


Then by taking the absolute value of both sides of (12), we have 

n 

(14) l^i + il ^ ki-p| +/*tf(|a 0 | + 2 l a fc|ki + i-fc|) 

fc= i 


+ h 2 K 2 1« 0 | 2 |fr||e,-,| + A|r l + 1 |; i > s. 

i = 0 

We again introduce a comparison or majorizing set, {a t }, defined by 
(15a) a 0 = max (|e 0 |, ki|, • • •, |e,|), 

(15b) a i + 1 = (1 + hKA + h 2 K 2 \a 0 \B)a, + hr; 

and claim that 

\e,\ < a,; j = 0,l,...,N. 


389 


The proof of this inequality is easily given by induction. From (15), 
{a t } is a non-decreasing sequence. Therefore, (15a) establishes the inequality 
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for j < s. Now assume the inequality holds for all j < i where i > s. 
Then (14) implies 

|*. + i| < + hK^\a 0 \ ai . q + 2 l“fcki + i-fcj 


+ h 2 K 2 \a 0 \ 2 |ft|«,-y + hr. 


) = o 


/ « \ m 

1 + A* |« 0 | + ^ M) + h 2 K 2 \a 0 \ 2 IAI 

\ k = 1 / j = 0 

= (1 + hKA + /i 2 A' 2 |« 0 |5)fl ( + At 


a ( + At 


— 1. 

Note that from the recursive expression for ef +u we have the single esti¬ 
mate | e*+ 1 1 < (1 + AJ?AT)|ai| - The application of (15b) recursively, in the 
by now familiar manner, yields the final result which can be summarized as 

theorem 1. Let the predictor-corrector method (7) be applied to solve (0.1) 
or (1) in a < x < b with “ initial ” values , u u satisfying 

(16a) |Ui - y(x { )\ < a 0i i = 0, 1,..., s . 

Let f y {x , y) be continuous and bounded in S : {(x, >>) | a < x < b; \y\ < oo}. 
Then with the definitions (9) and (13) the errors in the numerical solution 
satisfy , for a < Xj < b , 

(16b) |«y - **,)| < [« 0 + ^ + ^P)] 

x exp [(x, — a)K(A + A/T|a 0 |5)], 

(16c) |«,* - >'*(-*,)| < [«„ + W + 

x exp [(*, - a)^(T + A* |a 0 |5) + hBK]. ■ 

From this theorem it follows that u j converges to y(xj) as h 0 if 
a 0 0 and r —0. The order in h of the estimate (16) is the minimum of 
the orders in h of a 0 and r. We say that the methods for selecting the initial 
data and method (7) are balanced if a 0 and r vanish to the same order in h. 
If they do not, then some “extra accuracy” has been wasted. 

If the exact solution, y(x), has sufficiently many continuous derivatives^ 
then the local truncation error, r, can be simply expressed by using the 


t If all partial derivatives of order p of fix, >>) are continuous, then ,><-*) has a con¬ 
tinuous derivative of order p 4- 1. This results by differentiating (1) often enough. 
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methods of Section 1, Chapter 7. For instance, from the crude estimates 
of the form (1.8) of Chapter 7 and (10) we find 

T = ®{h m + 2 ) + 0(h n + 1 ). 

From these estimates, we see that the predictor and corrector formulae 
are balanced if m + 1 = n. The method is said to be of order t = min 
(m + 2, n + 1). [In the special case p — n — \,q = m + 1, with m and n 
both even, 

t = &(h m + 3 ) + <9(h n + 2 l 

which results from the estimate of error in the Newton-Cotes formulae 
(1.11) of Chapter 7. Parity makes m = n optimal.] 

Of course, roundoff errors are committed when the formulae in (7) 
are evaluated. If we call U i and U* the numbers actually obtained in these 
evaluations, then we can write 

m 

(17a) Ut x = U>- 9 + h ^ &/(*,_„ £/,_,) + p* +u 

j = 0 

(17b) U i + 1 = Ui-p + h 2 a fcf( x i + i^ i + i-fc) 

= i 

+ h a of( x i + u U*+i) + Pi + i ? i > s. 

Here p* and p f are the roundoff errors introduced into each of the indicated 
computations. Now we define the errors 

Ut-yixj), E? = U;* - y*; 

and obtain from (8), (9), and (17), as in the derivation of (12) 

n 

(18) E i + 1 — Ei_ v + h 2 a kgi + l-k^i + l - k + h a ogl+ lEi-q 

k — 1 


+ h 2a ogi + l 2 P&t- jEi-j — ^ T i +1 pi + l 


+ h a ogi + ip*+ u i ^ ■?> 


with 


gj = fy(x j9 y f ) and y, e (y f , U } ) for j # / + 1, 

while 

Ti + i 6 (T*+ij W+1)* 

By applying the previous method of analysis to this system we find the 
total error bound in 




392 ORDINARY DIFFERENTIAL EQUATIONS 


[Ch. 8] 


theorem 2. Under the hypothesis of Theorem 1, with the notation (13) 
and (17), 


(19) 



r + !« 0 | Kp* + (1/% 1 
K(A + hK\a 0 \B ) 


x exp [(x, - a)K(A + hK\a 0 \B)], 


where 

( 20 ) 


a < Xj < b, 


p = max \ Pj \, P * = max | P/ *|, 

S<j < N S<j<N 

b 0 = max \U j -y(x j )\. ■ 

0 </<$ 


It is of interest to note that in the bound (19) the corrector roundoff 
enters in the form pjh while that from the predictor has a coefficient 
independent of h. However, it is unlikely that any special measures in 
actual computation could be adopted to balance these different orders in 
h of the roundoff. If in Theorem 2 we know that 


b 0 = 0(h), P * = 0(h), r = 0(h), and P = 0(h 2 ), 


then (19) yields \E j \ < Vh for a < Xj < b for a constant V independent 
of h. 

It should be observed that if a 0 = 0 in (7), the predictor is never used. 
The corrector in this case is an open formula, and the above error analysis 
then applies to the method based on the use of a single open formula. The 
corresponding result for the method based on the use of a single closed 
formula (i.e., the implicit method) is obtained by a slight modification of 
the above technique (see Problem 1). Now if, in the predictor-corrector 
method, more than one iteration is employed, the estimates (16) and 
(19) no longer apply. But a comparison of the error bounds of the pre¬ 
dictor-corrector method and the corresponding implicit corrector method 
shows that there is no great gain to be expected in using the corrector 
more than once, provided haf +1 and a i + 1 are of the same order in h. 

We can remark further that in Theorem 2 the requirement that f(x, y) 
and f y (x, y) be bounded and continuous for \y\ < co can be replaced by 
the milder restriction that f(x, y) and f y (x , y) be bounded and continuous 
in the strip 


S': {(x, y) \ a < x < b, \y — y(x)| < d } for some d > 0, 


provided that: 

(a) h is sufficiently small, 

(b) b 0 - 0(h ), P * = 0(h), 

(C) p = &(h 2 ). 


t = 0(h) and, 
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To show that the estimate (19) holds, we now could show inductively that 
the values U* +1 and U i + 1 exist and are in the strip S', and therefore 

(19) is satisfied for j = / + 1. The constant K = max \f y \ replaces the 

S' 

previous definition of K in (19). 

2.2. Change of Net Spacing 

During the course of a computation based on a predictor-corrector 
method, we should keep track of the “measure of error,” 

+1 — Uf+i — Vi + 1* 

That is, 

(20) b i + 1 | = | U i + 1 - U ? +1 1 = | E i + 1 - E? +1 + Aof+il 

^ |£j+i| + l-Ei + il + A|ff,* +1 |. 

Hence if b, + 1 | is large, we know that the actual error is probably large. 

An isolated mistake in computation may be responsible for a large 
b, + 1 |. But, if the computation is correct, then the obvious way to reduce 
rj l + l is to reduce the interval size A. In practice, this is usually done by 
successively halving A. Alternatively if the estimate (20) becomes very 
small, A may be increased, say, by doubling it. 

Doubling the interval size offers no difficulty if at least 25 points have 
been computed with the net spacing A. We merely discard the data at 
every other net point, replace A by 2A, and continue the calculations. 

On the other hand, in order to halve A, we require data at s/2 new 
intermediate points, say 

x t - A/2, - A/2,..*, + 1 - (s /2) ~ A/2. 

These values can be determined by the application of an interpolation 
procedure which uses the known data at appropriate net points 
Xj = x 0 + jh. However, the accuracy of the interpolation formula must be 
consistent with that of the predictor-corrector formula being used. That 
is, the interpolation error must be at least of the same order in A as the 
truncation error , r, given in (10). Otherwise, as Theorem 1 indicates, the 
accuracy of the numerical solution will not be greater than that determined 
by the interpolation. The caution required in order to reduce the net spacing 
without sacrificing accuracy is one of the disadvantages of predictor- 
corrector methods when compared to single-step methods of the next 
section. Sometimes it is feasible to actually restart the integration at x i9 
by using the method employed at x 0j but with the net size A/2. 
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1. Verify the entries in Table I labeled Modified Euler and Milne’s method 
(3 points). 

2. Verify the entries in Table I labeled Improved Adams and Milne’s Method 
(5 points). 

3. For each of the methods of Table I, with what interval size h, and how 
many decimal places should the equation 

/ = y y(0) = i 

be solved in 0 < x < 5 in order that the error satisfy 

|£i| = I Ut - yi \ < lO^ 4 ? 

4. Assume that /(;c, y) and f v {x y y) are bounded and continuous in S : 

{(x,y) \a < x <b,\y\ < oo}. 

Then if h < l/JoeoA"| where K = max |/ v |, the implicit scheme (3a) can be 
used to find the {w t }, given u 0y u ly . . w r , with r = max (n y p). [See discussion 
after equation (6).] Estimate the total error, E { s U t - y iy in solving (1) by 
the implicit scheme based on (3a). 

[Hint; If we stop the iterative process described in (5), when the equation 
(3a) is satisfied with the error p i + ly we will obtain a sequence { U t } with U 0 = w 0 , 
Ui = Mi,...» U r = u ry that satisfies 

n 

+ i = Ui-p + h ^ a,jf(Xi + i-j 9 Ui + i-j ) + Pi + i- 

i = o 

Equation (8b) defines the corresponding truncation error a i + 1 . Show that E { 
satisfies 


Ei 4 


n 

= E t - P + h 2 a i + i -jgi + i -fEi + 1 ~j 
j—i 


+ ha.ogi + i£( + i + Pj + i hoi + 1, 

where g j — fyix^yj) at some suitable point y f . Hence show that |£ t | < a ly 
where the sequence {a { } is defined by 


a 0 = max (\E 0 \ y ..|E r |) 

l 1 + hAK \ , / p + ha \ 

a,+1 ~ "'ll - h\a 0 \Kj + ll - 

where p = max |p,|, a = max |ct,|. Then show that 
a t + i = o f (l + hQ) + R 


KXA + |«q|) 
1 - h\*o\K 


R = 


p_±Jw ■, 

1 - h\a 0 \K i 


where 
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The higher accuracy predictor-corrector methods of Section 2 all require 
special procedures for starting the calculations. That is, some approximate 
solution, u h must first be computed for y = 0, 1,..., s. In addition, if the 
interval size, h , is reduced during the course of the calculation, care must 
be taken to preserve the accuracy of the method. The single-step methods, 
which we now consider, require none of these special measures. In fact, 
they can be used to determine the starting values and to change the net 
spacing in other methods. The price paid for these advantages is, in general, 
the requirement of a greater number of evaluations of the function /(x, y) 
(or functions related to it) for each step in the solution. 

Again we consider the initial value problem 

(1) ^ = /(*> .y), y(a) = y 0 ■ 

By single-step we mean that only data at x = x 0 are to be employed in 
obtaining the approximation to y(x) at x = x v Obviously such a procedure 
could then be employed at x 1# and so forth, to extend the solution with 
arbitrary step sizes. However, for convenience in exposition we shall 
consider calculations on a uniform net 

. b — a 
Xj = a + A h = ——• 

Any single-step method for approximating the solution of (1) in [a, b ] 
can be indicated by the general form 

(2a) u 0 = y 0 + e 0 , 

(2b) w ; + i = u, + hF{h, x f , u f ;f}, j = 0, 1,.. N - 1. 

Here we denote by F{h y x /? up, f} some quantity whose value is uniquely 
determined by the value of (h, x ; , w ; ) and the function /. For example, 
the Euler-Cauchy scheme in (1.1) is a single-step method in which 
F{h, x, u;f} = /(x, m). We shall see that a variety of different choices 
for F is determined by using Taylor’s theorem or quadrature formulae. 

It is a simple matter to obtain estimates for the error in a very general 
class of single-step methods. To do this we first define the local truncation 
errors , r ; + 1 , by writing 

(3) y(x j + 0 = y{x t ) + hF{h, x„ y(x,) ;/} + hr j+1 , 

j = 0, l,N — 1; 
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where y(x) is the solution of (1). The largest integer p such that |r ; | = 
@(h p ) is called the orderj of the method. As usual, the errors in the numeri¬ 
cal solution are defined by 

(4) e, = u, - y(x,), j = 0, l,..N. 

Now in analogy with Theorem 1.1 we have 

theorem 1. Let Uj be the numerical solution defined in (2) where 
F{h, x , u\f} satisfies 

(5) \F{h, x, u;f} - F{h, x , v;f}\ < K\u - 

for all (x, u) and (x, v ) in the strip S: {(x, y) \ a < x < b,\y\ < oo}. 
Then if y(x) is the solution of( 1), and {r ; } is defined by (3), 

(6) h - y(x,)\ < 7 = 0,1,..., N; 

where 

t ee max \ tj\. 

; 

Proof Subtract (3) from (2b) and use (5) to find 

l e ;+i| = I e i + h[F{h , x,-, iifj) - F{h, x„ y{x^,f}} - hr l + 1 \ 

£ (1 + hK)\ ei \ + hr, j = 0, l,N — 1. 

The remainder of the proof follows exactly as in Theorem 1.1. ■ 

If the function f(x y y) and the scheme, determined by F{h, x, u ; /}, 
have special smoothness properties it may be possible to replace (5) by a 
type of mean value equality , say, 

(7) F{h, x, «;/} - F{h, x, v;f} = G{h, x, u, v;f}(u - v). 

Here G is determined by the value of (A, x, w, v) and the function /. 
Again in the Euler-Cauchy scheme if bf/dy is continuous, then (7) holds 
with 


G{h, x, w, v;f} 


5/(x, Su + (1 — 0)v) 


for some 8 in 0 < 8 < 1. 


When this mean value property is satisfied, we can prove exact analogs 
of Theorem 1.2 and its corollaries. 

The roundoff errors in single-step methods can be treated very much as 


t We assume that f(x>y) has enough continuous derivatives so that p may be deter¬ 
mined by a Taylor’s series expansion of y(xj + h) — y(xf) — hF{h, x u y(xj ); /} in 
powers of h. 
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in Subsection 1.2. Thus the numbers actually obtained, say {U } }, in trying 
to evaluate the set {w ; } from (2) will, in general, have errors due to the 
finite precision arithmetic. These numbers will satisfy equations of the 
form 

(8a) U 0 = y 0 + po 

(8b) U i + 1 = Uj + hF{h, Xj , Uj\ f) + p i + 1 , j = 0, N — 1. 
Then, if (5) is satisfied, we deduce in an obvious manner 

/ ' + 

(9) \U, -^a- ; )| < j = 0,\,...,N, 

where p = max \p } \. Again we see that as /z —^ 0, while x j — a = jh = c 

i<; 

is fixed, the roundoff error may become arbitrarily large if the computing 
accuracy remains unchanged. This effect is due to the fact that infinitely 
many computations are required to get to the finite point x = c as h —> 0. 
If the single-step scheme is of order p , then t can be bounded by a term 
of the form Mh v . For numerical balance then, |p 0 | = &(h p )andp = @(h p + 1 ) 
are reasonable requirements for the magnitude of the rounding error. 

3.1. Finite Taylor’s Series 

If the solution of (1) has continuous derivatives of order r + 1 in 
[a, b], then by Taylor’s theorem: 

(10a) y{x, + 1 ) = y(x } ) + hy a) (x,) + • • • + ^ /'\x,) 

+ (Tny. /r+V{Xi + e>h)> 

0 < e t < 1; j = 0, l,N — 1. 

From the differential equation it follows that the higher order derivatives 
of can be expressed as 

y n w = f(x,}o, 

y' 2 Xx) = /,(*, y) + fy{x, y)y a \x), 

(10b) y 3) (x) = f xx (x, y) + 2 f xy (x, y)y a) (x) 

+ fyy(x, y)[y n> (x)f + f y (x, y)y m (x). 


or in general, 

(10b') /'\x) = fix, y(x)); 


v — 1,2,.... 
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Thus given the value of y(x) at a point, we may determine its derivatives 
if we can evaluate the partial derivatives of f(x , y). We use these observa¬ 
tions in the finite Taylor s series method for approximating solutions of (1). 
Equation (10a) suggests the scheme 

(11a) Uj + 1 = Uj 4- hu { p H- h ~ u ( /\ j = 0, 1,..., N — 1, 

where in analogy with (10b) we have defined, for given (x,, u } ), 

= /(*,, w ; ), 

W; 2> = fx(Xj, U } ) 4- f y (x„ Uj)u\ l \ 

(lib) W) 3) = f xx (x„ Uj) + 2 f xy (x„ U } )U ( J 1) 

+ U } )(u\ l) ) 2 + fy(Xj, Uj)uf\ 

These formulae are easily deduced from the compact symbolic formula 
obtained from (10b') 

<ub ') “™- [(I ’■«*■'>] 

The initial value is, allowing for an error in obtaining _y 0 , 

(11c) u 0 = y 0 + e 0 . 

The formulation of the method is complete and the approximation {u t } can 
be computed by recursive application of (11a) through (1 lc). 

To write the Taylor’s series method in the form (2) we need only define 
the operator 

(12) F{h, x„ «,;/} == u? + ^ ««»+•••+ ^ < 

where the u\ y) are defined in (11). Then from the expansion (10a) and the 
definition (3) of the truncation error for a one-step method, we obtain 

03) r y+1 = (7TT)i > ,<r+1> (^ + 

0< e,< i, j — o, i ,n — i. 

The method is thus of order r when the first r 4- 1 terms in the Taylor 
expansion are used. To verify condition (5) we use Taylor’s theorem in 
(lib), after eliminating all « (v) and v iv \ to get 

« a > - » (1) = (u- v)[f y ] 

U m - v™ = (u - v)[f xy +f y 2 +f yy f] 
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The arguments of f(x, y) and its derivatives which occur in the brackets 
are all of the form (*, Ou + (1 — 9)v) with different values of 6, in 
0 < 9 < 1, in different brackets. From the representation (lib') we find 
in a straightforward manner that 

(14) - v M = (u - v)\^ + f(x, y) 'fix, >■)}] 

v = 1,2,.... 

Hence we can conclude that if /(x, y) has sufficiently many continuous 
and bounded partial derivatives for (x, y) e S then F{h, x, w;/} defined 
in (12) satisfies (7). Thus (5) is also satisfied, say, for all h < h 0y where h 0 
is some fixed spacing. The constant K entering into (5) can be written in 
the form 

(15) M x + ^ M 2 +•■■+ M r , 

where M k is a bound on the appropriate bracket in the k\h equation in 
(14); i.e., 

Mk s S s P |^(^ +f(x ’ y) ^) /{x ’ y) ’ k=l,2,...,r. 

By applying Theorem 1 and equation (13), we find that for all h < h 0 

(16) | u, - yix t )\ < e K( */- tt) [|e 0 | + (7 ~^]. j = 0, 1,.. N, 
where 

M r + 1 = sup |/ r + 1) (x)|. 

la, b] 

If we neglect the initial error, i.e., set e 0 = 0, then the error is at most 
0{h r ). Thus the Taylor series method can be used to generate starting data 
which is consistent with any order predictor-corrector method provided 
only that /(x, y) is sufficiently smooth. However, many different function 
evaluations, as in (lib), are required and so this method is not very 
efficient. 

Let us assume that 

— C 2 < f y {x , y) < — B 2 < 0 for all x e [ a , b] and \y\ < oo. 

We show that in this case the bound (16) can be improved upon. Pick h 0 
such that 

K' = -B 2 + ^ M 2 +■■■+ ^ M r < 0 and 1 + h 0 K" > 0, 
where 

K" = -C 2 - B 2 - K'. 


|j/ = B V U +(1 ~6 v )V 
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Now we find, by retracing the proof of Theorem 1 with a little care that 
for all h < h 0 

(17) | u, - y( Xi )\ < ^-“)[kol + (r TWWA ' j = 0 ’ 1 ’-- ■’ K 

We note that the exponential here is a decreasing function of x y . 

3.2. One-Step Methods Based on Quadrature Formulae 

By integrating the differential equation (1) over [x y , x y + 1 ] we get 

(18a) y(x i+1 ) = y(x f ) + ( ’ f(x, y(x)) dx. 

Jx, 

Hence, we see that various forms for hF{h, x, u\f } are naturally suggested 
by quadrature formulae. However, as we are considering one-step methods 
the appropriate quadrature formulae should only employ nodes in 
[x ; , x J + 1 ], say, for example, the n 4- 1 points satisfying 

Xj < £o < < * * ■ < £ n < x y+1 . 

But the integrand or an approximation to it must be known at these nodes 
and so we require approximations to .y(f v ), v = 0, 1We have for 
the exact solution at these points, 

(18b) y(tj v ) = y(Xj) + f /(x, y(x)) dx , v = 0, 1,.. n. 

JX) 

Thus we could use a sequence of quadrature formulae to estimate suc¬ 
cessively the values y(f v ) and ultimately y(x f + l ). 

A general class of one-step methods based upon these observations is 
given by using (2) with 

(19a) hF{h, x„ «,;/} = h J «»/(&, V»)l 

H — 0 

(19b) £ 0 = Xj\ = £ 0 + 

v = 1, 2 

V- 1 

(19c) 7] 0 = u f ; r) v = 7^0 4- h 2 a vkf(£k, Ik) 

k~0 

If in (19) we regard u v as an approximation to _y(£ v ), then the sums in (19a) 
and (19c) can be regarded as approximations to the integrals, respectively, 
in (18a) and (18b). In fact, these considerations suggest that we require 


(20a) 

o = e 0 < q 1 < e 2 < ■ ■ ■ < e n < l 

(20b) 

n 

2 “» = 1 ; 
v — 0 

v- 1 

(20c) 

2 “«* = e v» v = i, 2,.. n. 

k = 0 
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Condition (20a) requires that all the nodes, £ v , lie in [x }> x, + 1 ] and form a 
non-decreasing set; condition (20b) implies that the sum in (19a) has 
degree of precision at least 0 as a quadrature formula over [£ 0 , £ n ]; 
conditions (20c) imply similar results for the sums in (19c) as quadrature 
formulae over [£ 0 , f v ]. If* in addition to (20b), we require that 

(21) 2 a «( e «y = riTT p = 

V = 0 P ' 1 

then the basic quadrature scheme in (19a) has degree of precision at 
least m . 

These considerations suggest many choices for the parameters in (19), 
some of which have been examined in the literature (i.e., Gaussian quadra¬ 
ture, equal coefficient formulae, etc.). In fact, in practice, the parameters 
are determined by the reasonable requirement that the local truncation 
error for a fixed choice of n, be of as high an order in h as possible. From 
(3), we determine the local truncation error, r y+1 , for the one-step method 
defined by (19) from 

n 

(22a) y(x, 4- h) = y(Xj) + /i ^ «»/(£», y n ) + hr j+l , 

v = 0 

where 

v - 1 

(22b) y 0j = y(xj); y vi = y 0 , + h ]> a vk f({ k , y ki ), v = 1, 2,. . n. 

k = 0 

If the parameters are given and f(x y y) has sufficiently many continuous 
derivatives, then y(x } + h) and the /(f fc , y k] ) can be expanded in powers of 
h [about x Jy _y(X/)]. Equation (22) then yields, upon equating coefficients 
of like powers of h in (22a), an expression for r y+1 . Obviously, this pro¬ 
cedure can be used to determine the parameters in (19) such that r y+1 has 
the highest possible order in h. The use of Taylor’s theorem here is similar 
to its use in Section 5 of Chapter 5 to determine high order approximations 
to derivatives, but is now much more complicated. We do not repeat here 
any of these lengthy calculations, but present in Table 2 some sets of param¬ 
eter values for one-step methods of indicated order. It is found, in fact, 
that for n = 0 and 1 the maximum orders are 1 and 2, respectively, and 
the conditions imposed are just those in (20) and (21) with m = 1 or 2. 
For n — 2 an order of 3 can be obtained, if, in addition to (20) and (21) 
with m = 3 one additional relation is satisfied; namely, 

n v - 1 

2 “» 2 “vA = £• 

v = 0 fc = 0 

(This relation can be explained as the result of requiring the coefficients 
a v and a vk to come from quadrature formulae with respective degrees of 
precision 2 and 1, at least.) 
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Table 2 Some Standard Single-Step Difference Methods 


Associated 

n 

Coefficients and Nodes 


Order in h of r 

Name 

v or j 

0 

1 

2 

3 

Modified Euler 

1 

a v = 

4 

4 



(P(h 2 ) 



o, = 

0 

1 






ctii = 

1 





Heun 

2 

«v = 

i 

0 

3 

4 


0(h 2 ) 



= 

0 

1 

3 

i 





= 

i 

3 







= 

0 

1 




Kutta 

2 

«V = 

* 

§ 

4 


&(h 3 ) 



= 

0 

4 

1 





“1/ = 

i 







a 2 y = 

-1 

2 




Runge-Kutta 

3 

a v = 

* 

i 

3 

1 

3 

4 

<9(h i ) 



0 i = 

0 

4 

4 

1 




<*lj = 

i 







«2 j ~ 

0 

4 






a 3j = 

0 

0 

l 



Runge-Kutta 

3 

Ct v = 

4 

1 

4 

4 

<9(h 4 ) 



^ - 

0 

1 

3 

4 

l 




«!) = 

T 







«2) - 

“ 3 

1 






a 3> = 

1 

-1 

1 




We shall now show that all the schemes included in (19) satisfy the mean 
value property (7), if f y (x, y) is continuous in S. Let the quantities £ v 
be defined as the are in (19) but with u j replaced by Vj. Now we introduce 
the notation g(x , y) = f y (x, v) and use the continuity of g to deduce 

(23a) (r /v - Q = (r /0 - {„) + h £ V k ) - /(&, «] 

fc = 0 

v- 1 

= (.Vo ~ lo) + h ^ « vkgkiVk - £*); ^ = 1 , 2 ,..n. 

fc = 0 

Here we have used 

(23b) = g(4, + (1 - 4>k% k ), 0 < ^ < 1, k = 0, 1,..n. 
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By applying the equations in (23) recursively, we can determine expressions 
for the (?? v — Q in terms of (t? 0 — £ 0 ). However, this procedure is rather 
complicated and so we will just present the result and verify it by induction. 
Let us define the quantities B vj as follows: B 0} = 1; 

&vj — 1 7 = 0 

v- 1 

(24) B vj = ^ “vkgkBkj -1 j=l,2,...,v v = 1,2, 

k = f -1 

B VJ = 0 j > v + 1 

Then we have 

(25) (j] v — £ v ) = (tjo — Co)(^vo + hB v i + h 2 B v2 + • • • + h v B vv ), 

v = 1, 2,..., n. 

To verify (25) by induction, we note that from (23) and (24) with v = 1, 

0?i “ £i) = 0?o — WO + h a io£o) = (^o — CoX*io + hBu). 

Thus (25) is valid for v = 1. We now assume (25) to be valid up to v — 1 
and use it in (23a) to obtain 

( v -1 k \ 

1 + h ^ a vkgk 2 h m B km \> 

k = 0 m = 0 / 



= (Vo - Co)(l + 2 

\ m = 0 




t + 1^* 


The induction is thus concluded and (25) is established. 
We now obtain, from the mean value theorem and (25), 

n 

F{h, x lt u t if} - F{h, x t , v,;f} = 2 «vgAv* - Q 

v = 0 


(Mj Vj) 2 ^ ^ 

v = 0 fc = 0 

since (? 7 0 — £ 0 ) = (w, — i> ; ). That is, we have established 

lemma 1. Under the assumption that / v (x, y) is continuous every one-step 
scheme defined by (19) satisfies the generalized mean value property (7) 
with 

G{h,x,u,v;f}= 2 (“vgv 2 hkB fi ■ 

v = 0 \ k = 0 / 


(26) 
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Here the B vk are defined in (24) and the g v and g k are values of g(x , y) = 
f y (x , y) at appropriate points of the form (x, <j>u - h (1 — 4>)v ), 0 < <j> < 1. 
Again it should be observed that if/ y < 0 in S , then by taking h so small 
that « v > 0, (26) implies G < 0. Hence in such cases, the initial error in 
one-step methods will decay exponentially with distance. 

Of course, the indicated class of single-step methods satisfies a Lipschitz 
condition of the form (5). To obtain a suitable constant K 0 we could use, 
from (26), 


*0 


sup 

S. h<h 0 


a v£v ^ B vk 

v = 0 k = 0 


However, this is not readily calculable and so we shall determine an upper 
bound for it in terms of the parameters of the scheme and 


M = sup 


d/(*, y) 


8y 


We note that |g v | < M for all v. Now define 

O = max ( T |«vfc|) and jS, = max \B vj \ 9 
v \ k^O I v 

and from (24) with yin 1 < j < v 

v - 1 

\B vj \ — M ^ | a v/cM^fc.y-i| 

k = j- 1 

v -1 

< M • max • 2 |« vfc | 


i<i<v 


k=j- 1 


Since the right-hand side is independent of v we conclude that 

Pj < M0&_i, 

and by recursion using = 1 

ft < j = 

Since \B^\ < ft for all v we have 


(27) 




< J kl^[ 

v — o L 

<M 2 l«v| • 

V = 0 


1 + 2 h\<S>M) k 

k = 1 

l - (h®My +1 
1 - (hQ>M) 
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If we require that the coefficients a v be non-negative and satisfy at least 
(20b), then 

n n 

2 w = 2 a " = L 


If further, h is chosen such that h < h 0 , where h 0 $>M < 1, then the above 
bound simplifies to 


(28) 




We also note that if the a vk are non-negative for all v and all A: — 0, 1,.. 
v — 1 and if (20) is satisfied, then O = 6 n and hence 0 < O < 1. For 
sufficiently small h , in any event , the above bound can be made as close as we 
please to M which serves as the Lipschitz constant in the simple Euler 
method treated in Section 1. 


PROBLEMS, SECTION 3 

1. Verify the entries in Table 2 under the name Modified Euler. 

2. Verify the entries in Table 2 under the names Heun and Kutta with n — 2. 

3. Verify the entries in Table 2 for both schemes under the name Runge- 
Kutta with n — 3. 


4. LINEAR DIFFERENCE EQUATIONS 

We recall that linear difference equations with constant coefficients have 
appeared previously in our study, for example, in Section 4 of Chapter 3 
and in Subsection 1.4 of the present chapter. The theory of such difference 
equations will be sketched here because it will be used in the general treat¬ 
ment of difference methods given in the next section. The general linear 
difference equation with constant coefficients is a relation of the form 

n 

(1) L(u t ) = 2 a * u f+s = c / + n, j = Jo,jo + !>•••• 

s = 0 

Here the quantities a s are the coefficients, the c jJtn are the inhomogeneous 
terms, and the sequence {ufi is to be determined subject in general to addi¬ 
tional conditions. Usually the sequence is desired starting from some initial 
index, say as indicated in (1) for j > j 0 . The difference equation in (1) 
is said to be of order n, if a n a 0 # 0, since then the indices on the u { vary 
over n + 1 consecutive integers. We shall see that a solution of an «th 
order linear difference equation is, in general, determined by specifying n 
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initial conditions. That is, if j 0 is the initial index then we adjoin to (1) 
the conditions 

(2) u fo = v 0i My 0 + 1 = v 1 ,.,.,u Jo+n . 1 = v n . x . 

We now have 


theorem 1 . If the difference equation (1) is of nth order then there is one 
and only one solution {«*} satisfying the initial conditions (2). 


Proof The existence of the solution follows trivially since a n ^ 0 
implies from (1) that 


(3) u i+n = (“.«>♦.) + Cj P’ 


j = jo Jo + 1, • • .. 


For uniqueness let there be two solutions, {uf} and {u”). Then their 
difference {wj = {uj — uf} satisfies (2) with v 0 — v x = • * • = v n _ x = 0 
and (3) with c j + n = 0. Thus we find that {wj = {0} and the proof is 
complete. ■ 


We consider the wth order homogeneous difference equations corre¬ 
sponding to (1), namely: 


(4) L(u f ) = 0; j = j 0 Jo + 1, -... 

If the sequences {wj and {rj are solutions of (4) then, by the linearity of 
these equations, the sequence {aw, + /fo,} is also a solution. Here a and jS 
are arbitrary numbers. Thus we easily find that the set of all solutions of 
(4) forms a linear vector space. A set of solutions, say r of them 

m 11 }, 

are linearly independent if only the trivial combination of the {w} v) } vanishes 
identically; that is, if 

«1 W ( t 1} + « 2 i 4 2) + • • • + « r wS r) = 0, for i = jo Jo + 1,..., 

implies that = a 2 = * ■ * “ a r = 0. This is essentially the same notion 
as the linear independence of vectors. A set of n independent solutions of 
the nth order equations (4) is called a fundamental set of solutions . 

A basic result now can be stated as 


theorem 2. Let {u\ v) }, v — 1 , 2, be a fundamental set of solutions 
of the homogeneous difference equations (4). Then any solution , {y f }, of 
these equations can be expressed uniquely in the form 
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Proof. Since the set {i4 v> } is independent, Theorem 1 implies that the 
n vectors u (v) , where u (v) = (w[ v) ) for / = jo, jo -f 1, .. j 0 4* « — 1 , are 
linearly independent. That is, according to Theorem 1, if the u (v) were 
linearly dependent, then the corresponding infinite sequences {w[ v) } would 
be linearly dependent. Hence the nth order matrix 


. q > 


• u h 

/ (1) 
ho + 1 

M ;‘o + i 

■ <’+1 

h'o + n - 1 

o + n -1 * 

•• Kin- 1- 


is non-singular. Thus given any n components, say, v {uj for i = j Q , 
j 0 -f 1 ,.. ., 7 o + n — 1 , we can uniquely solve the nth order system 

Aa — v. 

The components a v of a are the coefficients to be used in the theorem. 
Since the first n components of any solution {rj can be expressed as a 
linear combination of the first n components of the fundamental set the 
theorem now follows by an application of Theorem 1. ■ 

We can, furthermore, find a fundamental set of solutions of (4). We try 
as a solution the powers of some scalar, say 

u i = ccx t ; i =j 0 ,jo + 1 , • • •• 

Then (4) yields 

(a n x n + a n _ iX n ~ 1 H-4- ao)(ax j ) = 0. 

If ax j = 0 the corresponding solution is trivial and does not lead to a 
fundamental set. Hence, we only consider the roots of 

(5) p n (x) = a n x n + + * • • + a 0 = 0 . 

The rtth degree polynomial p n (x) is called the characteristic polynomial 
of the difference equation (4). We easily find that if jc is a root of (5) then 
{ 14 } = {x { } is a solution of the homogeneous difference equations. If the 
roots of the characteristic equation are distinct , say x u x 2 ,.. ., x n , then a 
fundamental set of solutions is given by {«j v) } = {xj}, v — 1,2 
Since a n a 0 # 0, there is no zero root and the independence of the {w[ v) } 
follows from the independence of the first n components. That is, let us 
define the matrix 

U = (u (1) , u< 2 \...,u (n) ), 

whose columns are vectors obtained from the first n components of the 
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set of solutions defined above. Then clearly, U is non-singular if the x { 


are distinct, 

since 

“1 

1 

• • 1 “ 

“Xi ; o 

0 “ 

U = 

*1 

*2 

• X n 

x 2 *° 



v-n — 1 
L-M 

vrt “ 1 

x 2 

v n -1 
A n 

0 

-1 

o 

H 


If the roots x t are not distinct, we can still define a fundamental set of 
solutions. Let x x be a root of multiplicity m 1 > 1 of p n (x ) = 0. Then we 
use the powers to generate one solution and successive derivativesf with 
respect to x u up to order m x — 1 to generate m x — 1 additional solutions. 
Specifically, let = x/. Now try 

{»?’} = -£T m • • •. {<*’} = - ll }. 

However, since any solution can be multiplied by a non-zero constant we 
multiply the resulting { u ^ v) } by xi” 1 , to retain the original powers of x x 
in corresponding terms, that is, we introduce 

« v) } = xr i{ u (v> }> v = 1,2,..., w,. 

The elements of these sequences are found to be 
= Xl * 9 
vT =jx x y , 

fy 3) = y'C/' - i)V, 


vT° =j(J- 1) • • O' - "h + 2)Xi y , for y = y 0 , y 0 + 1 ,.... 

We leave the verification that these form m 1 solutions as Problem 1. 

By forming linear combinations of the solutions {i>y v) } corresponding 
to a root, x lt of multiplicity m x , we find the simpler sequences {w , $’’ ) } 

<’ = V, 

wf> = yV, 

(6) <'=jV, 

W< m i> = j m i ~ 1 x 1 i , j = y'o, y'o + 1,- 

t To motivate this procedure, observe that if Xi and x 2 = Xi + h are two roots of 
(5), then 

Ui = h~ l [(x i + A)‘ - Xi 1 ], / = y'o,y'o + 1,..., 

is a solution of (4). But then 

lim u t = ix i" 1 


is also a solution. 
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These sequences are obviously linearly independent and are solutions 
because they can be obtained as linear combinations of the {v $ v) }. 

Let us now consider the inhomogeneous «th order linear difference 
equations (1). If {t> ; } is any particular solution of equation (I) and {w y } 
is a solution of the homogeneous system (4), then {w ; + v i } is a solution 
of (1). This solution of (1) can be made to satisfy any particular initial 
conditions by adjusting the {uff We now develop a discrete analog of 
what is known as DuhameFs principle in the theory of differential equations 
(where integral representations of solutions are obtained). 

theorem 3. Let {uf} be the fundamental set of solutions of the nth order 
homogeneous difference equation (4) which satisfy the initial conditions 

(7) df = S iv , / = 0, 1.w — 1; v = 0, 1 ,..n - 1. 

Then the solution of ( 1) subject to the initial conditions (2) with j 0 = 0, 
is given by 


(8) - "f +X £ 

- n u n k = 0 


j y( 71 1 ) 

^ k 4- n**j - fc - 1> 


j = 0,1,.... 

v = 0 u n k - 0 

(Here we define df~ X) ~ 0 for all / < 0 and c } = 0 for all j < n.) 


Proof The first sum in (8) satisfies the initial conditions (2) and the 
homogeneous difference equations. Thus we need only show that the second 
sum in (8) satisfies homogeneous initial conditions and the inhomogeneous 
difference equations (1), with j 0 = 0. Let us define 


i J'-n 

M n ir n 


L k + n u j -k- 15 


j = 0,1, 


Then we have vv ; — 0 for j = 0, 1,— 1 by recalling that w< n_1) = 0 
for i < n — 2 and c i — 0 for j < n. In fact, for the same reason, we may 
write 

1 00 

W; ~~~ ^k + n«y - k - 1 » 

“n k - - co 


since the additional terms vanish. Then 


1 n 

= 2 = — 2 2 c k+n «5v,-k- 

o - n a n ‘ n I, 


n s = 0 k = - CO 


1 ^ 7 

= 2 + rMj + s -fc - 1> 

«n s = o te = 0 

since the terms corresponding to other values of k vanish. Hence, 


Li"’,) = 2 2 

a n k = 0 
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However, it is easily verified that 

L(u t , n -- k 1 2 1 ) = a n S ik 


and so finally, 


L(w f ) = c j+n . 
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PROBLEMS, SECTION 4 

1. Verify that equation (6) defines Wi linearly independent solutions of (4). 

2. Verify that a fundamental set of solutions of (4) is obtained from (6), 
if is replaced by m { and by x t for each root x { of (5) of multiplicity m { . 


5. CONSISTENCY, CONVERGENCE, AND STABILITY OF DIFFER¬ 
ENCE METHODS 

The numerical procedures which we have introduced in Sections 1 
through 3 in order to approximate the solutions of differential equations 
may be called difference methods. In this section we study the convergence 
of a more general class of difference schemes. The analysis constitutes a 
uniform development for all of the commonly used methods that were 
treated separately in Sections 1 through 3. 

The solution of the difference equations is what we try to compute, and 
this may have to be done for very fine meshes, i.e., for many net points. 
Thus, as a practical matter, it is important that these solutions should not 
be too sensitive to small errors in the computations (for example, roundoff 
errors). This sensitivity to errors is related to what is called the stability 
of the difference equations. We have already investigated such matters but 
without the introduction of this terminology. We shall see that for con¬ 
sistent methods, stability of the difference equations is equivalent to 
convergence of the difference equation solution to the solution of the 
differential equation problem. 

As usual, we consider methods for approximating the solution, >>(x), of 
the initial value problem 

/ = f(x, y), a < x < b, 

( 1 ) 

y(a) = y 0 ■ 

We assume that f(x, y) is in the class 3F of functions such that f y (x , y), 
f x (x, y ), and all partial derivatives of / of some finite order q > 1 are 
continuous and uniformly bounded in S : {(a, y) \ a < x < b\ \y\ < oo}. 
For any fixed net spacing h = (b — a)jN, we use a uniform net x, — 
a + jK j - 0, 1,.. ., N y and seek approximations u i to y{x ] ) on this net. 
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The approximations are defined as the solution of some difference problem, 
say 

(2a) a„w /+ „ 4- • * • 4- a 0 u f = hF{h, w / + n ;/} + hp j+n9 

j = m,m 4- 1 ,..N — n; 

where {a ( } are real constants independent of h satisfying a n a 0 ^ 0, and 
pj+ n is the local rounding error subject to 

|p y + n | < p{h) y for j > m. 

The initial data are specified as, say 

(2b) Uq Jo d" POi Ml Tl ~b ply . . . j U m + n - l ym + n- 1 F Pm + n~ 1 
where 

\p k \ < r(/j), for0</c</w4-« — 1. 

We shall later require that p(h) -> 0 and r(h) 0 as h 0. 

By suitably defining F{h, x w ; _ m ,..., u j+n ;f} 9 we may incorporate in 
(2) all of the schemes treated in the previous sections. On the other hand, 
the only properties that we need postulate for F\ in order to make this 
general study of convergence, are easily seen to hold for all of the commonly 
used difference methods. That is, we require 

(3a) F{h, x f ; « y _ m ,..., «, + n ; 0} = 0; 

(3b) \F{h, x ,; - F{h, x t » ttj — m> • • •» u, + n \f) I 

n 

^ C 2 + “ u i + k\i 

k = - m 

where the constant C depends only on the bounds of/and a finite number 
of its partial derivatives in S. The local truncation error , r y + n , is defined by 

n 

(4a) 2 c ‘ lc y l+k -hF{h,x J ;y f . m ,...,y l + n ;f}=hT l+n , 

k = 0 

where j(x) is a solution of (1). We further require that 
(4b) l r ;+nl ^ r (h) for m < j < N — n 

and 

(4c) lim r{h) — 0, 

h -*0 

i.e., the truncation error tends to zero. Condition (4c) implies that the 
difference equation (2) is an “approximation” to (1), rather than some 
other equation. Strictly speaking, we say that (2) is consistent with (1) 
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if r{h) 0 and r(A) 0 as h 0. [For example, let m = 0, n = 1; 
a 0 — — 1; = 1 and 

F{h, x f ; u t _ m ,u i + n ;f} = f(x t+1 , u j + 1 ) + f{x„ u,). 

Then (2) is not consistent with the equation (1). That is, by using Taylor 
series in (4a), for small h y r ; + 1 ^ - /(x,, y f ). Hence r j does not approach 
zero. In fact, this scheme is consistent with the equation y' = 2 f(x, /).] 
If, for all h < h 0 , 

(5) r ; + n < r(h) = Mh p , m < j < N — «, 

where M depends only on the bounds of / and a finite number of its 
derivatives in 5, we say that the truncation error of the difference method 
is of order p. 

We, of course, are interested in characterizing the convergent schemes 
and in obtaining an estimate of the error. For a fixed mesh width, h, we 
define the point wise error 

e, s u, - y„ where y t = y(x f ). 

The method (2) is convergent if, for any f(x, in ^, max |e y | -^-0 as 

0 <j£N 

h—r 0, provided that the rounding errors p(h) and r(h) tend to zero. 

If scheme (2) is convergent for all/in then it is convergent for the 
problem (1) with f(x, y) = 0, y 0 = 0. From this simple observation, we 
note that if (2) is convergent and F satisfies (3a), then the solution {w,} of 

Q-rMj + n + ^n-l^y + n-1 + * ' * + Q^U j = 0 

( 6 ) 

U 0 — Po> U l = Pu • • •» U n -1 = pn-l) 

must tend to the solution y(x) = 0, for any set of initial errors {p k } such that 
max \p k \ -> 0 as h -> 0. 

k 

We say that the difference method (2) satisfies the root condition if 

(7a) P(Q = a n t, n + a n ^it ) n ~ 1 + • ■ • + a if + a 0i 

has only zeros Ci such that 

(7b) |C,| < 1, 

and the multiplicities, r u of the are such that if 
(7c) |£ fc | = 1, then r k = 1. 

In other words, scheme (2) satisfies the root condition, if the zeros of 
P(0 lie in the unit circle and only simple zeros may lie on the boundary 
of the unit circle. We can now establish a necessary condition that (2) 
be convergent; i.e., for the solution of (6) to tend to zero. 
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theorem 1. If ( 2) is convergent and F satisfies (3a), then (2) satisfies the 
root condition (7). 

Proof. We show by contradiction that the root condition is necessary. 
That is, if \Q > 1 and fi is a complex root of />(£), define 

( 8 a) «, = *(£.' + £. 0 , 7 = 0 , 1 ,...; 

if is a real root set 

( 8 b) Uj = hi x >. 

Clearly, (8) is a solution of (6) with p k = u k for k = 0, 1, ..n — 1, 
and max | p fc | —> 0 as h —> 0. On the other hand, for any c in a < c < b 
set j — [c/h]. But then \u lclh] \ —>oo as h -> 0. Hence such a scheme is not 
convergent. 

If on the other hand, |f t | — 1 and is a multiple root and complex set 
(9a) «, = /?/(£('+ £A 7 = 0,1,...; 

while if Ct is real, set 

(9b) = hjtf 

Now if 7 = [c//z], \u icfh) \ does not approach zero as h^0. Hence such 
a scheme is not convergent. ■ 

The requirement that the solution {u t } of (2) depend Lipschitz contin¬ 
uously on {p k } is the definition of stability . That is, we say that (2) is stable 
if for any / in IF, there is an h 0 and an M , such that for all 0 < h < h 0 , 
and N = N(h) = (b - a)/h 

(10a) \u t — yj < Me, for 0 < / < 7/ 

whenever {r t } satisfies 


(10b) 

a n v j + n + ■ • ■ + a 0 Vj 

hF{h , Xj , Vj _ m ,. . ., Vj + n , _/*} T h&j + n , 



IV 

3 

(10c) 

V 0 = To + • • 

*» ^m + n—1 .Vm + n —1 ”1" ^m + n-1 

where 

1 pk - °k\ 

< € for 0 < k < N. 


It is then easy to show 

theorem 2. If the scheme (2) is stable and F satisfies (3a) then the root 
condition (7) is satisfied. 

Proof The proof follows by contradiction as did the previous theorem. 
Merely verify that in the case f = 0 and p j+n = 0 for j > 0, the definitions 
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(8) or (9) with h replaced by 8 define a solution {w t } of (2). On the other 
hand, set a k = 0 and v t = 0. Then it follows that 

| p k — o k \ < e for 0 < k < n — 1, 

where e is proportional to 8. But now, (10a) cannot be satisfied for any 
fixed M as h —> 0. ■ 

Next we have 

theorem 3. If ( 2) is consistent with (1), i.e. satisfies (4) and r(h) 0 
as h 0, and F satisfies (3), then (2) is a convergent scheme if and only if 
the root condition (7) is satisfied. 

Proof. In Theorem 1 we have shown that the root condition is neces¬ 
sary for convergence. We now assume that the root condition holds and 
prove that (2) is convergent. By subtracting equation (4a) from equation 
(2a), we obtain a difference equation satisfied by the pointwise error 
- T;> 

n 

(11) 2 a * e ’ + s = c > + "’ for m < j < N - n, 

s = 0 

where 

c i+ n = h[F{h, X t -, M,• + „;/} - F{h, x f ; .... y i + n ',f}] 

+ hpi + n ~ flT i + n . 

We may solve the inhomogeneous difference equation (11), by using 
Theorem 4.3, in the form 


(12) e j + m = 2 e m + k U T + — 2 c k + m + nWy -V- 1. 

k = 0 k = 0 

for j — 0, 1,— m. 

[That is, define 

Ej — 6j + my Cj — + for j 0, 1,.... 

Then (11) holds, i.e., 

n 

2 a s E J+s = C ; + „ for 0 < j < N - m - n. 


Hence equation (4.8) gives a representation for £*,, which reduces to (12).] 
But because the root condition is satisfied, we know from Theorem 4.2 
and equation (4.6) that the solutions {«'/ f) } satisfy 

(13) I 5 Q 

for some constant Q independent of k and j. 
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Furthermore, from the definitions of c j + n , p(h ), r(h) that appear after 
equations (11), (2a), and (4a) respectively, and from (3b), we find 

(14) kk + m + nl < h\c 2 kk + r| + p{h) + r(/j)j- 

If we use the estimates (13) and (14) in (12), we have 
k, + m | < Qn max km + k| 

0<k<« — 1 

+ — (j — n + 1 )h[(m + n + 1 )C max \e T \ + p(h) + r(h)] 

@n 0 < r < ;' -r m 

This inequality simplifies, if we introduce 
ajj = max \e k \, 

0 <k<i 

to read 


(15) \e j + m \ < Qnw m+n _ 1 + 


QU ~n+l)h 


x [{m + n + 1 )Cu> j + m + P (h) + r(h)l 

for j = 0, 1,. . N — m. 

Since a> y+m is equal to \e k \ for some index k < j + m and since n > 1, 
we find from (15) that a fortiori 

(16) w / + m < Kjho) J + m + Qnco m + n _ 1 + [p(h) + t(A)], 

“n 

where 

QC(m + w + 1) r . „ , 

k = --> for / = 0 , 1 ,. .., N - m. 

a n 

If we limit the range of j, as h tends to zero, so that 

(17) jhx < i, 
then (16) yields 

(18) < 2to 

If we now employ the definition of (u J9 (18) yields, using r{h) defined 
after (2b), 

(19) ky + m| < 2Q[nr{h) + m £ a * h) \ for 0 < jh < 1. 

Equation (19) bounds the pointwise error, for a finite interval 
( a , a + 1/(2*)), in terms of the bound for the initial error, cu m+n _!, and 
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the bounds p(h ) and t(Ji) of the rounding and truncation errors. The length 
of the interval of convergence, 1/(2 «), is independent of h and is defined 
after (16). 

Hence we may repeat this argument by beginning with the m + n 
errors bounded by (19). 

e ll!(2Kh)] -m -n + 1» e [lH2Kh)] - m - n + 2 j ■ • e [\K2Kh)Y 

In this way, we may successively establish pointwise convergence as 
/i —^ 0, in the finite number of intervals 



where R = [(6 — tf)/(2*:)]. The error estimates for successive intervals 
can then be seen to satisfy, in analogy with (19), 

(20) 4 + 1 < 2g[/»4 + — 2 ^ —for p = 1, 2,.. R, 


where I p is the pointwise error bound for the interval 


(• + £ r- + 0 


From (20) it is then possible to recursively bound I R + 1 and hence to 
bound \e f \ for 0 < j < N by 


(21) \e,\ < (2 QnY^rQi) + 


(2072) R + l - 1 Q( P (h) + r m 


2 Qn - 1 Ka n 

if 2Qn # 1; 

r(h) + iA±mm±m, if2C „ _ 


Formula (21) not only establishes convergence of the finite difference 
scheme (2), but gives an upper bound for the error in terms of the initial 
error, the rounding error and the truncation error. This bound is of the 
same general character as were the bounds that we derived earlier for the 
special methods treated in Sections 1 through 3. ■ 

By essentially the same arguments we could prove: 

theorem 4. If the F in (2) satisfies (3) then (2) is a stable scheme iff the 
root condition (7) is satisfied . 

We have therefore established the important consequence of Theorems 
3 and 4; 
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theorem 5. If the scheme (2) is consistent with (1) and F satisfies condition 
(3), then the necessary and sufficient condition that (2) be convergent is 
that it be stable. ■ 

It is possible to strengthen Theorem 3 by noting that F need only 
satisfy the Lipschitz condition (3b) in a narrow strip about the solution 
jy(x) given by S d : {(*, y) \ a < x < b; \y — y(x)\ < d) for any fixed 
constant d > 0. That is, for h sufficiently small, the error estimate (21) 
shows that if the solution of the difference equation starts in the strip 
S dl2 then it remains in the strip S d . 

The special case with m — 0 and 

n 

(22) F{h, x,; u, _ m ,.. m,■ + ,;/} = ^ bj(x j+s , u f+s ) 

s = 0 

has been treated by Dahlquist. He found the surprising result that although 
by proper choice of the 2 n + 1 independent parameters { a s /a n }, { b s /a n }, it is 
possible to construct a scheme having a truncation error of order 2 n\ 
only schemes with a truncation error of order at most n -f 2 may be con - 
vergent. (In fact, if n is odd then only schemes for which the truncation 
error is of order at most n 4- 1 may be convergent.) The implicit scheme of 
equation (2.3a) with p = n — 1, based on the Newton-Cotes quadrature 
formulae applied to (2.2) with p — n — 1, then has the maximum possible 
order of truncation error for convergent schemes of form (2) with F 
given by (22). Dahlquist’s work finds other schemes having a truncation 
error of the same order, but shows that schemes which are both convergent 
and of greater accuracy do not exist. 


PROBLEMS, SECTION 5 

1. Define Fforthe following schemes treated in Sections 1 through 3 given 
by 

(a) equation (1.1a) 

(b) equation (1.24a) 

(c) equation (1.37a) 

(d) equation (2.3a) 

(e) equation (2.3b) 

(f) equation (2.7a and b) 

(g) equation (3.11a and b) 

(h) equation (3.8) and (3.19a, b, and c) 

and verify that conditions (3a and b) are satisfied. Which of these schemes do 
not satisfy the root condition? 

n 

2. If (2) is convergent, show that Pfl) = £ a s = 0. 

[Hint: Let f(x,y) = 0; y(a) = y 0 # 0; Pk = 0.] 
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3. In the scheme (2) with F given by (22), show that with T(£) = 2 

s — o 

P'O) = 7(1) implies that the truncation error is of order p > 1. 

[Hint: Expand the left side of equation (3c) about (x h y f ) in powers of h. 
Observe that 

no = 2 sa s , 7(1) = I b,.\ 

S = 0 S = 0 


6. HIGHER ORDER EQUATIONS AND SYSTEMS 


Any rth order ordinary differential equation, 

d r z [ dz d r ~ 1 z\ 

dx r - 2 ’ dx'"’ dx'-'Y 


can be replaced by an equivalent system of first order equations. There are 
a variety of ways in which this reduction can be performed; the most 
straightforward introduces the variables 


y a Kx) = z{x\ y™(x) = y«Xx) 


dy (r X) (x) 
dx 


Then the differential equation can be written as 

— v (1) = v (2) 
dx y y 9 



(r-l) 


= y (r) 


y r) = g(x, y a) , y (2) , y r) ). 

This is, of course, a special case of the general system 

(la) % - f(*; y). 

Here we have introduced the r-dimensional column vectors y and f with 
components 

/ v) (x),f (v Kx; y) = / (v) (x, y u , y {2 \ .. / r) ), v = 1, 2, . . ., r. 

We will study the difference methods appropriate for solving a system (la). 
The initial data for such a system are assumed given in the form 

Ob) y (a) = y 0 , 

where we seek a solution of (1) in the interval a < x < b. 
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All of the difference methods previously proposed for a single first 
order equation have their direct analogs for the system (1). With (la) 
in component form, 

= / <v) (*> y U W. • • •. /%*)), v = 1, 2,..., r ; 

it does not require much insight to write down the corresponding difference 
methods based on quadrature formulae or even the single-step methods. 
In fact, the general predictor-corrector becomes, in vector form, 

m 

(2a) u,* +1 = u,_, + h 2 «!->); 

i = 0 
n 

(2b) u ( + 1 = u,_ p + h 2 «fcf(x i + 1 -k; Ui + i-fc) + ha 0 f(x i + 1 ; u,* +1 ). 

k = 1 

Similarly, the general one-step difference methods for the system (1) 
can be written as 

(3) u J + 1 = u ; + h¥{h, x f ; u ; ; f}. 

For the class of methods defined in Subsection 3.2 we take (for systems) 

(4a) h¥{h, x,; u ; ; f} = h J «„f (&;*„); 

U =0 

(4b) £o ~ 4■ 6 v h; 

V- 1 

(4c) t) 0 = u ; ; rj„ = rio + h 2 «vkf(^; %), 

k = 0 

v = 1, 2,.. n. 

Here the quantities a u , 6 V , and oc vk are defined as in (3.20) and (3.21). 

As in Section 2 we define the truncation error in predictor-corrector 
methods applied to a system. That is, an r-dimensional vector, Tf + 1> 
defined by 

n 

(5a) y, + 1 = y,_„ + A ^ a fc f(x, +1 _ k ; y, + i_ fc ) 

k= 1 

4- /?a 0 f(Xi + 1 ; y? + i)4-/zT i + 1 . 

Here y? +l = y i + 1 — ha? +1 defines af +1 , and y t * +1 is defined by the right 
side of (2a) with u replaced by y. We find that 

(5b) t = a 4- ha 0 Ja\ 

where a* and a have components er (v) * and <r (v) which are respectively the 
errors in applying the appropriate quadrature formulae to the integration 
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of / (v) (x, /'>(*)). The elements of the matrix J are found by 

evaluating the corresponding elements of the Jacobian matrix 

(5c) '-<0=P 

at appropriate intermediate points. The detailed derivation of (5) is left 
as an exercise. By using the matrix (5c) we find that the error vectors 

(6) e, = u, - y y , e ; * = u ; * - y,* 

satisfy the systems 

m 

(7a) ef +1 = e,_ s + A ^ 

J = 0 

n 

(7b) e i + 1 = ei_ p + h ^ + 

j = 1 

m 

+ h 2 a 0 J i + 1 ^ - H + i- 

/ = 0 

Here again the matrices have as elements the a vu of J evaluated at appro¬ 
priate intermediate points (the elements in each row of Jj can be shown to 
be evaluated at the same point). A convergence proof can now be given 
exactly as in Subsection 2.1 (see Theorem 2.1) if we employ appropriate 
vector and matrix norms. 

In fact, if the root condition (5.7) is satisfied, we may copy the proof 
of convergence and the error estimates given in Theorem 5.3 by replacing 

y, f, u, v, e, F, p k , r k , c k , E k , C fc , | | 

by the corresponding vector quantities 

y, f, u, v, e, F, p fc , x fc , c*, E fc , C fc , || ||„ 

(i.e., absolute value is replaced by maximum absolute component), for 
the scheme 

n 

(8a) 2 a ^ +s = hF{h, x,\ u,_ m> ..u, +n ; f} + hp j + n , 

i = 0 

where m < j < N — n, with 

(8b) u 0 = y 0 + p 0 , Uj = y! + p 1; ..u m + n . 1 = y m + n _ x + p m + „-i. 


PROBLEMS, SECTION 6 

1. Verify the error estimate corresponding to equation (5.19), as indicated 
in the last sentence of Section 6, for the scheme defined by (8). That is, show that 


|e,-l 


; 2Q\n 


max 

Osksm+n-l 


lie* | * + 


p(h) -I- t(/Q ' 

lKQ n 


for 0 < jh < 

2k 
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QC(m + n + 1) 
k = - 

and C, which appears in the vector analog of (5.3b), is a bound for the vector 
norm || || w of f and of all its partial derivatives with respect to x, y (1 \ y (2 \ . . ., 
y r) of some finite order, in the domain S : {(x; y) | a < x < b, ||y||« < oo}. 

2. Verify that if u = (w (fc) ), v s (t> (fc> ), and/(x; y) has a continuous derivative 
with respect to all variables, then 

/(*; u) - /(x; v) = 2 (w <fc) - v (k> ) (x; v + 0(u - v)) 

for some 9 such that 0 < 9 < 1. 

Hence, if / is replaced by a vector valued function f, each component / (;> 
of f may have its own 0, satisfying 0 < 0* < 1. 

[Hint: Study g(t) s /(x; v + J(u — v)). Note that ^(r) — ^(0) = tg'(9t) 
for some 6 in 0 < 6 < 1. Then evaluate 

J t g (0 = V + /(u - v)) 

and set t = 1.] This justifies the definition of A in (7) and J in (5b). 


7. BOUNDARY VALUE AND EIGENVALUE PROBLEMS 

A boundary value problem for an ordinary differential equation (or 
system) is one in which the dependent variable is required to satisfy 
specified conditions at more than one point. Since an equation of «th 
order has a general solution depending upon n parameters, the total 
number of boundary conditions required to determine a unique solution is, 
in general, n. However, when the total of n boundary conditions is given 
at more than one point , it is possible for more than one solution to exist 
or for no solution to exist. Of course, if more than n conditions are imposed, 
even for the initial value problem, there will, in general, be no solution. 
A detailed study of the existence and uniqueness theory is beyond the scope 
of our book. However, for linear problems, the theory is well known 
and we shall indicate here the elements of this theory which may be 
applicable to non-linear problems and to the analysis of numerical pro¬ 
cedures used to solve such boundary value problems. 

The simplest linear boundary value problem is one in which the solution 
of a second order equation, say 

(la) y" - p(x)y’ - q(x)y = 0, 
is specified at two distinct points, say 

(lb) y(a) = «, 


y(b) = p. 
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The solution y(x ), is sought in the interval a < x < b. A formal approach 
to the exact solution of the boundary value problem is obtained by con¬ 
sidering the related initial value problem, 

(2a) Y* — p(x) Y* -q(x)Y=0, 

(2b) Y(a) = a, Y'(a) = 5. 

The theory of solutions of such initial value problems is well known and 
if, for example, the functions p(x) and q(x) are continuous on [ a , b], 
the existence of a unique solution of (2) in [a, b] is assured. Let us denote 
this solution by 

Y = Y(s; x), 

and recall that every solution of (la) or (2a) is a linear combination of two 
particular “independent” solutions of (la), y 1} (jc) and y {2 \x), which 
satisfy, say, 

(3a) y a \a) = 1, y av (a) = 0; 

(3b) y i2) (a) = 0, y (2V (a) = 1. 

Then the unique solution of (2a) which satisfies (2b) is 

(4) IXs; x) =«y 1) (-x) + sy* 2) (x). 

Now if we take s such that 

(5) Y(s;b) = ay a \b) + s/ 2) (b) = ft 

then y(x) = T(.s;jc) is a solution of the boundary value problem (1). 
Clearly, there is at most one root of equation (5), 

ft — ay a \b) 

s ~ 7W) ’ 

provided that y {2 \b) ^ 0. If, on the other hand, y {2 \b) = 0 there may not 
be a solution of the boundary value problem (1). A solution would exist 
in this case only if = a>> (1) (6), but it would not be unique since then 
T(s; x) of (4) is a solution for arbitrary s. 

Thus there are two mutually exclusive cases for the linear boundary 
value problem, the so-called alternative principle : either a unique solution 
exists or else the homogeneous problem ( uey{a) = y{b) = 0) has a non - 
trivial solution (which is sy (2) (*) in this example). 

These observations permit us to study the solution of the inhomogeneous 
equation 

(6) y" - p(x)y' - q(x)y = r(x), 

subject to the boundary conditions (lb). This problem can be reduced to 
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the previous case if a particular solution of (6), say / p) (x), can be found. 
Then we define 

(7) h'(a) = y(x) - y {v) (x), 

and find that w(x) must satisfy the homogeneous equation (la). The 
boundary conditions for w(x) become, from (7) and (lb) 

w(a) = a — y {p \a) = a, 

w(b) = p — y ip \b) = P'. 

Thus, we can find the solution of (6) and (lb) by solving (1) with (a, ft) 
replaced by («', jS'). A definite problem for the determination of / p) (x) 
is obtained by specifying particular initial conditions, say 

(8) f p \a) = y (p) '(a) = 0, 

which provides a standard type of initial value problem for the equation 
(6). Again, the alternative principle holds; see Problem 1. 

The formulation of boundary value problems for linear second order 
equations can be easily extended to more general nth order equations or 
equivalently to nth order systems of first order equations (not necessarily 
linear). For example, in the latter case we may consider the system 

(9a) y' = f(x; y), 

where we use the row vectors y = (y l? y 2y .. y n ), f = (/i,/ 2 , and 

the functions f k = f k (x; y) = f k (x; y u .. y n ) are functions of n + 1 
variables. The n boundary conditions may be, say, 

yM) = ^ 2 («) = «2> * • • > y m M) = a m 1? 

(9b) 

y mi + i(b) = ft, y mi +2 W = ft,.. yjfi) = p m2 , 

m l > 0, m x 4- m 2 = n, m 2 > 0. 

Thus, we specify m 1 quantities at jc = a and the remaining n — m x = m 2 
quantities at x = b. 

In analogy with (2) we consider the related initial value problem: 

(10a) Y' = f(x, Y); 


(10b) 


Y&a) = a t , 


Y mx + y(^) *$/, 


i = 1, 2,..., m 1 
j = 1, 2,..., m 2 . 


We indicate the dependence on the m 2 arbitrary parameters by writing 


n. 


Tfc s 2 ,. • s mz , .x), h 1,2,..., 
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These parameters are to be determined such that 
(11) Y mi + ,(s u s 2 , ..., 5 , m2 ; b ) = ft, j = 1, 2,.. m 2 . 

This represents a system of m 2 equations in the m 2 unknowns In the 

corresponding linear case (i.e., in which each f k is linear in all the 
the system (II) becomes a linear system and its solvability is thus reduced 
to a study of the non-singularity of a matrix of order m 2 . 

Note that the alternative principle is again valid. In the general case, 
however, the roots of a transcendental system (11) are required and the 
existence and uniqueness theory is more complicated (and in fact, is not 
as completely developed as it is for the linear case). 

We shall examine two different types of numerical methods for approxi¬ 
mating the solutions of boundary value problems, in Subsections 7.1 and 
7.2. 


7.1. Initial Value or “Shooting” Methods 

The initial value or “shooting” methods attempt to carry out numeri¬ 
cally the procedure indicated in equations (2) through (5). That is, roughly, 
the initial data are adjusted so that the solution of an initial value problem 
satisfies the required boundary condition at some distant (boundary) 
point. 

We take, for definiteness, a uniform net 
(12) = a y Xj = x 0 4- A j = 0 , 1, .. ., N, h = 

and shall try to approximate thereon the solution of the linear equation (6) 
subject to (lb). We first approximate the solutions AX*) and AX*) of the 
initial value problems (la) and (3). This can be done, for example, by 
replacing (la) by an equivalent first order system and then using a predictor- 
corrector or one-step method as indicated in Section 6. In the same 
manner, we can approximate the particular solution AX*) of (6) and (8). 
The respective numerical solutions are denoted at each point x j of (12) 
by 

(13a) U?\ ; = 0,1.JV. 

These solutions satisfy, at x 0 — a , the conditions 
(13b) u ( 0 u = 1, U? = 0, Up = 0. 

Assume that the same numerical procedure has been used to compute 
each of these solutions and that we have 

(14a) = u- AX*/) = 

(14b) ef ss u\ 2) - AX*,) = ®(h r l 

(i4c) = up - y p x*,) = ®(h r y 
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[That is, the truncation error of the integration scheme is ®{h r ), and the 
rounding errors are at most (P(h r+1 ) so that the estimates (14) apply.] 

The exact solution of (6) and (lb) is, by the previous analysis, given by 


(15a) 

(15b) 


y(x) = / p) (x) 4- ajK (1) (x) + 5y (2) (x), a < x < b\ 

p - y {v \b) - *y n \b) 
s ~ y" ri (b) 


Of course, we assume that y 2) (ft) / 0- Otherwise the homogeneous prob¬ 
lem has a non-trivial solution and then, in general, the boundary value 
problem has no solution. With the use of (13), we take for the approximate 
solution the obvious combination 


(16a) 

where 

(16b) 


Uj = u'-p + au { p + s h uf\ j = 0 , 1 ,.. N; 
_ P - u< n’ ~ a U ( P 


From (13b) and (16b) it clearly follows that, as required, 

U 0 = «, U N = ft 

where we have neglected possible roundoff errors in forming Uj and s h . 
Thus, in principle, Uj is an approximate solution of the boundary value 
problem (6) and (lb). In practice, we need only calculate the solution of 
two initial value problems to evaluate Uj . That is, y ip) (x) + ay (1) (x) 
satisfies (6) and conditions (3a) so that iif ) + can be computed as 
the solution of a single initial value problem. 

Upon recalling (14), we are led to the obvious, and in fact, correct 
conclusion that 

e, = Uj — y(Xj) = <9(K). 

However, as we now show, there may still be practical difficulties in 
obtaining an accurate approximation. 

Upon subtracting (15a) with x = x, from (16a) and using the definitions 
(14), we find 


4 oej” 4 ser) 4 (s h - s)u™( Xj ) 9 j = 0, 1,..., N. 


(17a) e, = (ef 
Since b = x N , (15b) and (16b) imply e N = 0 and 


(17b) (s h -s) = - 

Use (17b) in (17a) to find 


eff 4 


4 seg 


i(2) 


r/2) 


j/ 2 ) 

(18) e, = {er 4 4 *>< 2) ) - (etf 4 4 ^ 2) ) 

u N 

From this expression for the error we see that e 0 = e N = 0 and thus the 
error is, in general, small near the endpoints of the interval. However, 
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whenever |wy 2) Mfl becomes large we may expect relatively large errors. 
This ratio can be computed and thus a practical assessment of the accuracy 
in the present method is possible. In particular, note that u\ [2) = y {2 \b) + 
e$\ Thus, whenever the fixed number y {2 \b) is small and opposite in sign 
to the error e ( £\ which depends upon h, we may find a magnification of the 
intermediate errors, e r 

The effect of roundoff errors in the present method can be very pro¬ 
nounced. While performing the calculations (16), significance is frequently 
lost when large almost equal quantities are subtracted from each other. 
This may be due to the occurrence of a small value of u$\ or to rapidly 
growing solutions jk (1) (x), y i2) (x) and/or / p) (*)- 

By using the estimates (14) in (18) we obtain the error bound 

(19) \U,-y(x t )\ = h| < Mh'(\ +U|)> j = \,2,...,N - l. 

Thus (for sufficiently small net spacing) the error behaves as theoretically 
expected. In practice, however, it may frequently be necessary to use many 
significant figures in the calculations to realize these error estimates. 

The method (and its attendant difficulties) treated in this subsection is 
easily extended to more general linear boundary value problems. 

We can alter the procedure slightly and, with considerably more com¬ 
puting effort, solve non-linear boundary value problems. For example if, 
in place of (1), the problem is 

(20) y" = f(x, y, /); y(a) = a, y(b) = 0; 
we consider the initial value problem [in place of (2)] 

(21) Y" = f(x , Y, Y'); Y(a) = «, Y\a) = j. 

If Y(s; x) is the solution of (21) and s* is such that 

(22) Y(s*;b) = p, 

then j(x) = 7(.s*; x) is a solution of (20). The equation (22) is, in general, 
transcendental, whereas in the linear case the corresponding equation, (5), 
is linear in s. 

The problem of solving (20) is reduced to the determination of the root 
(or roots) of (22). The root s* could be found by applying the iterative 
methods of Chapter 3. Of course, in each step of such iteration schemes at 
least one evaluation of F(s; b) is required for some value of s. This 
may be found only approximately by integrating (21) numerically on 
some net (12). That is, the net function Ufs), j — 0, 1,..., AT, may be 
constructed by some method described in earlier sections. Then Ufs) is 
an approximation to y(s; x } ). If the overall error of the integration scheme 
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is (9{h r ) and the function /(x, y , z) is sufficiently smooth, then for each s 
we will determine, in fact, 

U N (s) = Y(s; b ) + 0(AO- 

If the solutions of (21) are such that (22) has a simple root, 5 *, and 

os 

for \s — 5*| < p, then we can show that a functional iteration procedure 
will, if s 0 is close enough to s*, produce a sequence Jo* *Si, • • •> such that 
for some k 

U N (s k ) -p = Y(s k ; b) - j8 + W = W- 

Hence 

|j fc “ J*| = W- 

By using sufficiently many iterations, we can thus get within ®(h r ) of a 
root of (22) and hence compute a solution of (20) to within an error boun¬ 
ded by Mh r . In Problems 4 and 5, we indicate some of the details of these 
results. 

It is convenient for the application of Newton’s iterative method in 
solving (22) to approximate dY{s;b)jds. By differentiating (21) with 
respect to s , we can formally find the differential equation, called the 
variational equation , i.e., satisfied by the function W(s; x) = &Y(s; x)/ds : 

w - = ^ n w + l n iy. 

dy dz ’ 

W(s; a) = 0, W'(s\a) - 1. 

A numerical approximation to the solution of the variational equation 
may be computed stepwise along with the evaluation of Uj(s). Hence for 
j = N we would have an approximation for both 7(5; b) and dY(s; b)/ds. 

7.2. Finite Difference Methods 

We consider here finite difference methods which are not based on 
solving the initial value problem. These are called direct methods. The 
truncation error of the particular difference method we use is @{h 2 ) and 
the labor required for a given accuracy is comparable to that for the 
initial value method of some low order. 

Let the boundary value problem be (6) and (lb) which we write as 

(23a) L{y } = y" - p{x)y‘ - q(x)y = r(x); 

(23b) y(a) = «, y(b) = p. 

We impose here the restriction that 
(24) q(x) > <2* > 0, 


a < x < b. 
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The most simple existence and uniqueness proofs for solutions of boundary 
value problems of the form (23) require such a condition but with Q* > 0. 
We assume a unique solution of (23) to exist with four continuous deriva¬ 
tives in a < x < b. A uniform net will be used with h = (b — a)j{N 4- 1). 

Now rather than seek high order accuracy in a difference approxima¬ 
tion of (23a) we use the simple difference equations 

(25a) L h {u,} S + ^ - p{x,) ^ Y h- 

- q(x j )u j = r(xj), j = 1,2 N. 

The boundary conditions are replaced by 

(25b) u 0 = a, u N + 1 =p. 

Multiply (25a) by —h 2 j 2 to obtain 

h 2 h 2 

~j Lh{u,} = -Vr-1 + a i u i ~ c i u t+ 1 = —~2 r (*i)> J = !> 2, • • JV, 

where 

fly = 1 + y Vi)> *i = 5 [l + 5 V/)]’ 

(26) 

^ 1 [i _ 

Using this notation the system of difference equations (25a) and boundary 
conditions (25b) can be written in the vector form 

(27a) Au = r, 

where 
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Thus to solve the difference problem (25) we must, in fact, solve the jVth 
order linear system (27a) with tridiagonal coefficient matrix, A , given in 
(27b). 

Let us require that the net spacing h be so small that 

(28) |/>(*,)| < 1, j = 1,2,... ,N. 

Then from (26) it follows that 

M + k/l = + c, = i, 

while (24) implies a, > 1. So we deduce that 

kil > kil; 

l«y|>N + k,|, 2 < j < N — 1; 

M > kw|; 

and hence Theorem 3.5 of Chapter 2 applies. The solution of (27a) can 
thus be computed by the simple direct factorization of A described in 
Section 3 of Chapter 2. Of course, this furnishes a proof of the existence 
of a unique solution of the difference equations (27) provided (28) is 
satisfied. 

Let us now estimate the error in the numerical approximation defined 
above. The local truncation errors, r y , are defined by: 

(29) UiyiXj)} = r(x,) + r,\ j = 1, 2,..., N. 

Since y{x) is a solution of (la) we have, assuming y v (x) to be continuous, 

(30) Tj = L h {y{x,)} - L{y(Xj)} 

= pfo/ - h ) - 2 y( x i) ± y( x i ± h ) _ 

- Y 2 W6) - j - i.i. »■ 

Here and are in [xj. u x j+1 ] and we have used Taylor’s theorem. 

The basic error estimate can now be stated as 
theorem 1 . If the net spacing , h, satisfies (28) then 

(31a) \u, - jtx,)\ < h'l f* | 2 ^j , j = 0, 1,.... N + 1; 
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where y(x) is the solution of (6) and (I/?), {u } ) is the solution of (25) and 


(31b) 


P* = max |/?(x)|, 

[a. 6] 


M 3 = max |/%x)|, 

[a.b] 


M± = max |y v (jc)|. 

[a. b] 


Proof Let us define 

£>. - Uj - y(x } ), j = 0, 1,. .AT 4- 1. 

Then subtracting (29) from (25a) yields, with the aid of (26), 

ft 2 

(32) a j e i = bfij-x + c,e i + 1 + y r„ j = 1,2,..., N. 

Now with the norms 

e = max |e ; |, r = max |r^|, 

0<J<N + 1 1 <:j<N 

we obtain by taking absolute values in (32) and using the equation after (28) 

ft 2 

Wjej\ < e + -j r, j = 1, 2,.. N. 


However, by condition (24), \a } \ ^ a j > 1 + {h 2 j2)Q* and so the above 
implies that 

(i +fe*)k,1 sj + ft, y = i,2,...,7V. 

From (23b) and (25b) we have e 0 — e N + i = 0 and so the above inequality 
is valid for all j in 0 < j < N + 1. Thus we conclude that 


Finally, by using the quantities (31b) in (30) we find that 
r < ^ (M 4 + 2 P*M 3 ) 

and the theorem follows. ■ 


From Theorem 1 we see that the difference solution converges to the 
exact solution as h 0 and, in fact, that the error is at most (P(h 2 ). For 
equations in which p(x) = 0, error bounds that are (9{h 4 ) are easily obtained 
by using a slight modification of (25a) (see Problem 6). Boundary conditions 
more general than those in (23b) can be treated with no essential change in 
these results (see Problem 7). The condition (24) can be relaxed to q(x) > 0 
and a somewhat more involved argument yields a result analogous to 
that of Theorem 1. These arguments are based on a so-called maximum 
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principle (see Problem 9). The use of this maximum principle is demon¬ 
strated in Problem 10. 

The effects of roundoff in computing the solution of (25) can be esti¬ 
mated. In fact, let Uj be the computed quantities which, in place of (25), 
satisfy 

(33a) -y L h {U,} = -y r( Xj ) + Pj , j = 1, 2,.. N-, 


and the boundary conditions 

(33b) U 0 = a + p 0 , Un + i — ft + Pn + i- 

The quantities p j represent the local roundoff errors committed in each 
of the indicated computations. Now we define 

Ej = Uj — y( Xj ) y y = 0,1,..., + 1 

and exactly as in the proof of Theorem 1 we deduce that 

/ h 2 \ h 2 

(1 + 1 Q*)\E i \ <E +J r + P , j = 1,2,..., N. 

Here 

p = max \ Pl \ and |£i,I = |po|. I^w + il = Ipw + il, 

0£j<,N + 1 

E = max | Ej |. 

0<j£N+ 1 


If, in addition to (28), we require that h 2 Qj2 < 1, then this inequality 
is also valid for j = 0, and j= N + 1 so we finally obtain 

1 


E < 


( T + 2 F 2 )‘ 


Thus for sufficiently small net spacing, h, we have 

(Mi + 2 P*M a 


(34) | U f — y(Xf)\ < A 2 (- 


120 * 


hm 

j = 0, 1,..., N + 1. 


The roundoff affects this estimate somewhat differently than it did 
the corresponding estimates in Subsections 1.2 and 1.3, etc. Now to have an 
error bound which is €(h 2 ) we must limit the roundoff by p = 0(/z 4 ) 
as h 0. That is, two orders in h improvement over the local truncation 
error are required. Previously, only one additional order in h was required, 
since our difference equations were then approximations to first order 
differential equations (or systems of equations). 

Difference methods can also be applied to fairly general non-linear 
second order boundary value problems. While such methods accurate to 
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®(h 2 ) can be determined, the difference equations are now no longer 
linear. Hence iterations are employed to solve these equations. It should 
be observed that the iterations are not employed in order to satisfy the 
correct boundary conditions, as would be the case in the initial value 
methods. The construction of iteration procedures for solving the difference 
equations is quite simple. 

The non-linear boundary value problems we consider are of the form 
(20) where the function /(x, y 9 z) is assumed to satisfy the conditions 


(35) 


0 < 


e. s s e*. 


df(x, y, z) I 


dz 


< />*; 


in some sufficiently large region. Furthermore, these partial derivatives 
and y v (x) are assumed to be continuous. 

Again we use a uniform net. On this net the difference approximation 
of (20) is taken to be 


(36a) 

(36b) 


w,_ i — 2 u j + u 


j +1 


h 2 




= 7 \x h U h 


l 


W 0 = «, 


2h 


= P- 


Uj- 


j = 1,2,..., N\ 


The local truncation error, r j9 of this method is defined in the usual manner 
by 


(37) 


yi~i - 2 y f + y f +1 

h 2 


A 


x 3 , y* 


yn 


yj- 


2 h 




+ 


j= 1,2,..., N. 

From the assumed continuity properties of dfjdz and y v (x) it follows 
that 


(38) 



- 2 


y f , 

dz 



j = 1,2,..., A. 


Here, and r ij in [Xj- l9 x j+l ] are the appropriate mean values used in 
Taylor’s theorem. 

To examine the convergence of this procedure we introduce e 5 = u j — 
y(Xj) and, for the further applications of Taylor’s theorem, 


Pi 


<li 


= d ± 
dz 

= d l 

dy 


(xj, y, + 6,e h 
yi + Qfih 


yfoy + i) - y(Xj- 1) 

2 h 

y(x i+ i) - yiXf.j) 
2 h 


+ 

+ V) 


<h ±i ~ e y-i \. 

2 h r 
2 h r 


0 < 8, < 1. 


(39) 
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Then subtracting (37) from (36a) we get, with the above notation and 
appropriate values for the 

/, h 2 \ 1 /, h \ 1 / h \ h 2 

(40) p + 2 q *) e > = 2 \ 1 + 2 Pl ) e, ~ 1 + 2 \ 1 ~ 2 P ’H + 1 + T T ” 

j =1,2,..., N. 

This system of equations is formally identical to that in (32), with the same 
boundary conditions, e 0 = e N + 1 = 0. So we may conclude, by using (35) 
in place of (24), exactly as in the proof of Theorem 1 that 


(41) 


< + 2P*M 3 

12 Q* 


Here P * is defined in (35) and M 3 and A/ 4 are the appropriate bounds on 
the derivatives of the solution of (20). Thus the order of convergence for 
the non-linear problem is the same as that for the linear case; the constants 
in (41) have only slightly different meanings from those in (31). The non¬ 
linear cases for which the difference method is applicable can be generalized 
as are the linear cases in Problems 7 and 8. 

If /(a*, y, z) is not a linear function of y and z, then the difference equa¬ 
tions (36) constitute a non-linear system of equations. The general methods 
of Chapter 3 could be applied in order to solve such systems. In particular, 
Newton's method is frequently well suited for this purpose, and in special 
cases the convergence proof given in Subsection 3.2 of Chapter 3 can be 
applied. However, due to the special structure of this system some other 
iteration schemes are naturally suggested, and we shall consider one of 
them here. All of these methods proceed from an initial estimate of the 
solution, say 

uf\ j = 1, 2,..., N; i/ 0 0) = «, Uw+1 = J3. 


A particularly simple iteration scheme for solving (36) is defined by: 
(42a) (1 + 6u),/<“ + I > = 1 (h/ v -i + u ( /l J + wu'f 

h 2 


h f( „(v> <+1 - w-i\ 

- y /(*,> «, .- Yh - y 

j = 1,2,..., N; 


(42b) 


,/V+ 1) - iy(V + 1) - O 

U 0 — a, U N + ! — p. 


Here tu is a parameter to be determined so that the iterates converge. 
In fact, we can show, see Problem (11), that if cd satisfies 


(42c) 
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then the iterates satisfy 

(43) - «h < (i - ™ ax i*4 u - <1; 

y= 1,2,..., M 

From this result we see that the iterates form a Cauchy sequence. 
Thus not only do they converge but by the assumed continuity of /(x, y, z) 
we can show, exactly as in the proof of Theorem 1.1 in Chapter 3, that a 
unique solution of the difference equations (36) exists. 

7.3. Eigenvalue Problems 

We have shown previously that a linear boundary value problem may 
have non-unique solutions. In fact, this occurs if and only if the corre¬ 
sponding homogeneous boundary value problem has a non-trivial solu¬ 
tion. If the coefficients of the homogeneous equation depend upon some 
parameter it is frequently of interest to determine the values of the param¬ 
eter for which such non-trivial solutions exist. These special parameter 
values are called eigenvalues and the corresponding non-trivial solutions 
are called eigenfunctions. The simplest example is furnished by the 
homogeneous problem 

/' + Ay = 0; y(a) = y(b) - 0. 

For each of the parameter values 



there exists a non-trivial solution 

y(x) = y n (x) = sin \%(x - a\ n = 1 , 2 ,.... 

A fairly general class of eigen-problems, which includes many of the 
cases that occur in applied mathematics, are the Sturm-Liouville problems, 

(44a) L{y} + A r(x)y = [p(x)y'] f - q(x)y + \r(x)y = 0, 

(44b) « 0 y\a) - «i y(a) = 0, p 0 y\b) + ft y{b) = 0. 

Here p(x) > 0, r(x) > 0, and #(x) > 0; p'{x ), q(x), and r(x) are continuous 
on [a, b] \ and the constants a v and ft are non-negative and at least one of 
each pair does not vanish. It is known that for such problems there exists 
an infinite sequence of non-negative eigenvalues 

0 < < A2 < A3 ♦ * •. 


(45) 
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In addition, there exist corresponding eigenfunctions, y n (x), which satisfy 
the orthogonality relations 

[ yJ.x)y m (x)r(x) dx = 8 nm , 

J a 

and the nth eigenfunction has n — 1 distinct zeros in a < x < b. 

We may again relate the solution of (44) to an initial value problem. 
For any fixed A we consider 

(46a) L{Y} + Ar(*)y = 0; 

(46b) a 0 Y'{a) - a x Y(a) = 0, y 0 Y'{a) - Yl Y(a ) = 1. 

Here y 0 and y x are any constants such that (c^yo — ct 0 yi) ^ 0. Then the 
two initial conditions in (46b) are linearly independent and a unique 
non-trivial solution of the initial value problem (46) exists. We denote this 
solution by F(A; x), Now we consider the equation 

(47) 0>(A) ^ p 0 Y'(\; b) + ft^A; b) = 0. 

Clearly, each eigenvalue A n in (45) must satisfy this equation. Also every 
zero, A*, of 0(A) is an eigenvalue of (44) and the corresponding solution 
T(A*; x) of (46) is a corresponding eigenfunction of (44). Note that the 
present analysis differs from the corresponding discussion at the beginning 
of Section 7. Here, a parameter in the equation must be adjusted while 
the adjoined initial condition remains fixed, which reverses the previous 
situation. Of course, the present considerations apply to eigenvalue 
problems more general than those in (44); say for instance to problems 
in which the eigenvalue parameter A enters into all of the coefficients of 
the equation and the boundary conditions. Extensions to homogeneous 
systems, of, say m second-order equations with m parameters are also 
clearly suggested. The initial value procedure can actually be used to 
prove the existence of the eigenvalues (45) and the oscillation properties 
of the eigenfunctions. 

To approximate the eigenvalues and eigenfunctions for problems of the 
form (44), and various generalizations of these problems, we may apply 
numerical methods which are exactly analogous to those used in sub¬ 
sections 7.1 and 7.2. However, the proofs of convergence and estimates 
of the errors are now not always as easy to obtain as they were for those 
boundary value problems. 

Some approximation methods for eigenvalue problems are based on 
variational principles . These have led to the construction of useful numeri¬ 
cal methods. However, we do not treat them here, but refer the reader to 
the brief discussion of variational principles in Subsection 1.2 of Chapter 9. 
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A simple application of the basic error estimate for an eigenvalue of a 
symmetric matrix, Theorem 1.5 of Chapter 4, can be used to give an 
error estimate for the eigenvalue of a differential equation that is approxi¬ 
mated by a difference method (e.g., the method in Subsection 7.2). Con¬ 
sider the eigenvalue problem 

(48) L{y} = A y; y(a) = y(b ) = 0, 

where L{*} is defined in (44a). 

Assume that A is an eigenvalue and y(x) a corresponding eigenfunction, 
with a continuous fourth derivative. Let 

(49) L h {u} = Aw; u{a) = u(b) - 0, 

be a finite difference approximation to (48), on the net (12). Assume that 
the matrix form of (49), analogous to (27), is 

(50) Au = -y Au, 


where A is a symmetric matrix. Then the truncation error, t, of the eigen- 
solution is defined by 


h 2 

Ay + j Ay se 



If || t ||«, < Mh 2 , when flyll* = 1, then ||t || 2 < MN v ~h 2 . Furthermore, 
Theorem 1.5 of Chapter 4 implies 

h 2 h* 

min — | A — Ay| < — MN V % 
l<j'<Ar 2 2 

whence we have shown, 


THEOREM 2. 

min | A — A y | < h 2 MN% — 0(h 3/2 ). ■ 

1</<N 

Theorem 2 states that some eigenvalue, A ; , of the discrete problem (49) 
is a good approximation to a given eigenvalue A of (48). But as h^0, 
the theorem fails to identify which eigenvalue A ; - is the closest approxima¬ 
tion. In Problem 14, we verify that, in a special case, the smallest eigen¬ 
values Ay approximate respectively the lowest eigenvalues A ; . 


PROBLEMS, SECTION 7 

1. Establish the alternative principle . Either the equations (6) and (lb) have 
a unique solution or else the homogeneous problem [i.e., r{x) s 0, a = (3 = 0] 
has a non-trivial solution. 
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2. Solve by the initial value method 

y" = -100y; y(0) = 1, yi^rr + «f) = 1. 

Use 

y x) (x) = cos 10x, y (2 \x) ~ 


For small e, show that 5 — 50e+ ^(e 3 ).Explain why the computational scheme 
corresponding to the initial value method would be difficult to apply for 
small €. 

3. Solve by the initial value method 


Use 


y" = lOOy; y( 0) = 1, y( 3) = ?“ 30 . 


y (1 \x) « 


^10<x -|_ ^ IOat 
2 ’ 


y 2> « = 


20 


Explain why the computational scheme of the initial value method would 
have to be applied with great care. 

4. The chord method for approximating the root s* of (22) is based on the 
iteration scheme 

Sk + l = g(Sk), 

where 

£(s) = s - m[Y(s\b) - ft]. 


Show that if for some p > 0, 

lay 


0 < L < 


ds 


(s; b)\ 


then with 


< K, for 15 — 5*| < p. 


. (dY(s;b)\ 

i< slgn (—)’ 


L + 


. K - L , 

-TTl k l 


5. Let the approximate solution of (21) be V^s), 0 < j < N; and assume 
that, in the notation of Problem 4, 

m\ U N (s) - Y(s; b)\ < 8 = for \s - 5 *| < p . 

Define A = (A: - L)/(/C + L) and let h be small enough so that 8 < (1 - A)p/2. 
Use Theorem 1.3 of Chapter 3 and Problem 4 to show that, with <r k + 1 = 
- m[£/*(a k ) - 0], then 

1^ - **l £ i4a + Ak ( p “ T^a)’ 

if 

g 

|ct 0 - 5*1 < P - 
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6. For the boundary value problem: y" - q(x)y — r(*); y(a) — y(b) = 0 
use a difference scheme of the form: 


[u j + 1 - 2Uj + Uj-i]/h 2 - [<xiq(x j + 1 )u j + 1 + a 

+ a-i^(-Ty-i)Wy-i] = [«ir(x y + i) + a 0 r(x,) + 
for y = 1, 2,..TV, with w 0 — + i = 0 (as usual h = (b — a)/(N + 1)). 

(a) Determine a 0 , such that the truncation error is 0{h A ). We 

assume here that y v , q [v y and r lv are continuous. Note that for the solution 
>>(*) we have y iv - [q(x)y\ f ~ r"(x). 

(b) If q(x) > Q+ > 0, then show that for sufficiently small h: 


where = 


max 

la. b] 


I u, - y(x,)\ < 

I/"Ml, N t * 


h 1 2 M„ + 5 N, + 5R t 


720 


max 

[a, b] 


Q* 

\[q(x)y(x)r\ y R, m 


The proof is just as in Theorem 1. 

7. Consider the boundary value problem 


max |r iv (;c)|. 

la, b] 


y" - p(x)y' - q(x)y = r(x); a 0 y\a) - aiy(a) = a, 

Po/(b) + p !y(b) = p 

where a 0 , /? 0 , and are all positive. Use the difference equations 

u j + 1 - 2u j + u j . 1 u J + 1 - u j - 1 . . , v 

- F- ” /<**) “-j h - = 

for j = 0, 1 ,..N + 1 

and the “boundary” conditions 

/«1 — W_i\ 0 /w^ + 2 — Q Q 

«ol- 2/2 ) “ 0£lW ° = Pol- 2 h -) “ P lWw + 1 = P- 

[Note: Values at x-^ = a — h and x N + 2 — b + h have been introduced 
and the difference approximations of the differential equation have been 
employed at x Q - a and at x N + 1 = b. Hence the values x_i and x N + 2 can 
be eliminated from the above difference equations,] 

(a) Write these difference equations as a system of order N + 2 in the form 
(27). If the tridiagonal coefficient matrix is 


A = 



with j = 0, 1,../V + 1, show that from (26), 

Aj = a h Bj = b h Cj = Cj for y = 1, 2,.. N 

and that 

Aq = {qq t 2 h — b 0 \> Cq = (co + bo). 


Find similar expressions for A N + 1 and B N + 1 . 
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(b) If q(x) > Q+ > 0 and the solution y(x) is sufficiently smooth in an 
open interval containing [ a , b] show that for sufficiently small h 

|«i< 0(A 2 ), 7 = 0,1,...,7V + 1. 

8.* Consider the boundary value problem: 

[a(x)y']' - p(x)y' - q(x)y = r(x); y{a) = y(b) = 0 

and the corresponding difference problem: 


r / a 

\(u l + 1 - Uj\ [_ h 

\{Uj - 

H X# + 2 

)\ h 2 / ( ' 2 

A h 2 )J 


- 2h ~ * ) “ <A x i) u i 


r(x,)> 


j = 1, 2,..., N\ u 0 = u N +! = 0. 


(a) If / v and a" 1 are continuous, show that the truncation error in this scheme 
is 0(h 2 ). 

(b) If q(x) > Q* > 0 and A * > o(x) > A* > 0, show that 


k - ^ 


A* 


provided A+ - {hl2)\p{x } )\ > 0 for j = 1, 2, . , N 

[Hint: Proceed as in the proof of Theorem 1 but now divide by \b s | + \cj\ = 
bj + Cj > 2A+ before bounding the coefficients.] 

9. We define the difference operator T by 


where: 


Tuj = a f u f - b)Uj -i - CjU j + u j = 1, 2,.. ., TV, 


Prove the 


bj > 0 , Cj > 0 , Qj > bj + Cj. 


maximum principle: Let the net function { Vj} satisfy TV } < 0 ,j— 1,2,..., TV. 
Then 

max Vj = maxfKo, K N + 1 ). 

OijSW + l 

Conversely if TV f > 0, j = 1, 2,. .., TV; then 

min Vj = min {K 0 , F* + 1 }. 

OS)SJV + l 

[Hint: Use contradiction; assume max Vj = M is at K* for some k in 
1 < k < TV but that V 0 ^ M and V N + 1 ^ M. Then conclude that V f = M 
for all j which is a contradiction. The minimum result follows by changing 
sign.] 

[Note: The conditions on the coefficients in T are satisfied by the quantities 
in (26) provided (28) is satisfied even if we allow q(x) — 0 (i.e., if condition 
(24) is weakened)]. 
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10. Let T be as in Problem 9 and {f?y} satisfy 

Te t - a h j = 1, 2,. . N. 


Suppose {gf) satisfies g s > 0 and 


Then prove that 


Tg, > 1. 


\e f \ < max (|e v | + <rg v ) + o-g , 

V = 0 , V + l 


where cr = max \a f \, g — max |^y|. 

[Hint: Form a>y = ey ± o#y and apply the maximum principle.] 

11. * Prove that (43) follows from (42). 

[Hint: Subtract (42a) from the corresponding equation with v + 1 replaced 
by v; use Taylor’s theorem and proceed as in the derivation of (41).] 

12. * Consider, in place of (36), the difference equations u 0 — u N + 1 = 0; 


U ) + i - 2i/y + Wy. 

h 2 


1 = f{*u 




+ w, 




Wy_ 


2^ 




j = 1,2,...,M 


(a) Show that |«y — y(Xy)| = 0(/i 2 ) where y(x) is the solution of (20); 
(28) and (35) hold; and y lv (;e), df/dy and df/dy' are continuous. 

(b) Under the above assumptions and h(P * + hQ*)/2 < 1, prove con¬ 
vergence of the iterations: 

,/v +1) _ ..(v + 1) _ rj 

«o — +1 — 


= + «$-i] - ?/i 




+ «$-i - «$ v ii 

2 ’ 2 h 


)• 


y = 1,2,..., m 


Note that the parameter is not required here, as it was in (42); i.e., we 
could employ the value o> = 0. 

13. Solve for the eigenvalues and eigenvectors of the problem y* 4- Ay = 0, 
y'(a) = y'(b) = 0, by using the initial value technique. For example, use the 
initial values y'(a) — 0, y(a) — constant ^ 0. 

14. Find the eigenvalues and eigenvectors of the scheme 


Wy_! - 2llj + u j+1 

h 2 


AWy, 


1 < j < N, 


u o — Un + i — 0, h — ^ j* 

Compare them with the eigenvalues and eigenvectors of 
y" = Ay, y(0) = y(tr) = 0. 

[Hint: Solve the difference equation in the form = a\ for an appropriate 
a. Show that the eigenfunctions are 


< , = ^ sin/ (rn)’ 

for k = 1,2,..., TV.] 


0 < j < N + 1, 
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15. (a) Use the results of Problems 6 and 14 to devise a difference 
scheme for y" + Xy = 0, X0) = XI) = 0 which yields 0(/z 4 ) approxima¬ 
tions to the eigenvalues. 

(b) Find the eigenvalues of the difference scheme and verify directly, 
with a comparison to A„ = n 2 7T 2 , that they are actually (P(h 4 ). How accurate 
are the eigenvectors? 

16. * Derive a variational differential equation that is satisfied by d F(A; x)/dX y 
where Y is a solution of (46). Describe how Newton’s iterative method might 
be formulated to solve for an eigenvalue A from (47). 




9 

Difference Methods 
for Partial Differential 
Equations 


0. INTRODUCTION 

Although considerable study has been made of partial differential equation 
problems, the mathematical theory—existence, uniqueness, or well- 
posedness—is not nearly as complete as it is in the case of ordinary differ¬ 
ential equations. Furthermore, except for some problems that are solved 
by explicit formulae, the analytical methods developed for the treatment 
of partial differential equations are, in general, not suited for the efficient 
numerical evaluation of solutions. Hence, as may be expected, the theory 
of numerical methods for partial differential equations is somewhat 
fragmented. Where the theory of the differential equations is well developed 
there has been a corresponding development of numerical methods. But 
the difference methods found thus far usually do not permit the construc¬ 
tion of schemes of an arbitrarily high order of accuracy. For certain systems 
of partial differential equations convergent numerical methods of arbitrarily 
high order of accuracy have been devised (for instance, linear first order 
hyperbolic systems in two unknowns); while for others (say the simple 
case of the Laplace equation on a square) only relatively low order methods 
have been proved to converge. Furthermore, in contrast to the case of the 
numerical solution of ordinary differential equation problems, the facility 
with which one may use difference methods on modern electronic com¬ 
puters to solve problems involving partial differential equations is severely 
limited by (a) size of the high speed memory, (b) speed of the arithmetic 
unit, and (c) difficulty of programming a problem for and communicating 
with the computer. 


442 
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In view of the limitations of the scope of this book and of the incomplete¬ 
ness of the theory of difference methods, we shall illuminate some of the 
highlights of this theory through the treatment of problems for the 


(la) 

8 2 u 

d 2 u 

0 

Laplace equation; 

dx 2 + 8y 2 

(lb) 

8 2 u 

. 8 2 u 



8t 2 C dx 2 

0 

Wave equation; 

(lc) 

8u „ 8 2 u 

8t~ a 8x 2 ~ 

0 

Diffusion or heat conduction equation . 


The applications of these equations are so varied and well known that we 
do not make specific mention of particular cases. Of course, in applied 
mathematics other partial differential equations occur; most of these 
are non-linear and not covered by a complete mathematical theory of 
existence, uniqueness, or well-posedness. 

To each of the equations in (1) we must adjoin appropriate subsidiary 
relations, called boundary and/or initial conditions , which serve to complete 
the formulation of a “meaningful problem/’ These conditions are related 
to the domain, say Z), in which the equation (1) is to be solved. When the 
problem arises from a physical application it is usually clear (to anyone 
understanding the phenomenon) what these relations must be. Some 
familiar examples are, for the respective equations (la, b, and c); 

(2a, i) u = f(x, y ), for (x, y) on the boundary of /), 

or with d/dn representing the normal derivative, 

(2a, ii) au + £ = f(x , y ), for (x, y) on the boundary of D ; 

(2b, i) u{ 0, *) = /(*), = g(x), -co < x < oo, 

where D = {(/, x) | t > 0, — oo < x < oo}, i.e., D = half plane, or 

«(0, x) = f(x), = g(x), 

(2b, ii) 8t 

u(t, a) = a{t), u{t, b) = £(f), t > 0, 

where D = {(/, x) \ t > 0, a < x < b} y i.e., D = half strip; 

(2c, i) w(0, x) = f(x ), —oo < x < oo. 
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where D is the half plane, or 

.. ... fw(0, x) = /(x), a < x < b, 

(2c, n) < 

{ u(t, a) = a(t ), u(t, b) = m 

where D is the half strip. 

If the functions introduced in (2a, b, and c) satisfy appropriate smooth¬ 
ness conditions, then each set of relations (i) or (ii) adjoined to the corre¬ 
sponding equation in (1) yields a problem which has been termed well-posed 
or properly-posed by Hadamard. This implies that each such problem 
has a bounded solution, that the solution is unique, and that it depends 
continuously on the data (i.e., a “small” change in f, g , a, or produces a 
“correspondingly small” change in the solution). There are many other 
combinations of boundary and/or initial conditions which together with 
the equations in (1) (or more general equations) constitute properly posed 
problems. It is such problems for which there is a reasonably developed 
theory of difference approximations. We shall examine this theory briefly 
in Section 5, after first studying some special cases. However, as we shall 
see in Section 5, the theory serves mainly to determine whether a given 
method yields approximations of reasonable accuracy; but the theory does 
not directly suggest how to construct numerical schemes. 

0.1. Conventions of Notation 

For simplicity, let the domain D have boundary C and lie in the three 
dimensional space of variables (x, y, t). Cover this space by a net, grid, 
mesh, or lattice of discrete points, with coordinates (x f , y iy t k ) given by 

Xi = x 0 + iSx, y j = y 0 + jhy , t k = t 0 + kSt; 

U jy k = 0, ±1, ±2,.... 

Here, we have taken the net spacings Sx, Sy, and St to be uniform. The lattice 
points may be divided into three disjoint sets: D d , the interior net points ; 
C 6 , the boundary net points ; and the remainder which are external points. 
Here we assume, again for simplicity, that C is composed of sections of 
coordinate surfaces. The specific rules for assigning lattice points to a 
particular set will be clarified in the subsequent examples and discussion. 

At the points of D d + C 6 the function w(x, y, t) is to be approximated 
by a net function, U(x u y h t k ). It is convenient to denote the components 
of net functions by appropriate subscripts and/or superscripts. For 
instance, we may use 

U{Xi,yj) = U(x„y„ 4) = Ut y, etc. 

This notation is frequently cumbersome and at times difficult (if not 
unpleasant) to read. Thus, while we shall have occasion to use it, we 
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prefer another notation, more in keeping with the usual functional nota¬ 
tion. If U has been defined to be a net function, then we may 
write U(x , y , t) and understand the argument point (x, y , t) to be some 
general point of the net on which U is defined. Furthermore, if we simply 
write U then the argument is understood to be a general point (x, y , t) 
of the appropriate net. 

We shall make frequent use of various difference quotients of net 
functions (of course, in order to approximate partial derivatives). For this 
purpose we introduce a subscript notation for difference quotients of net 


functions 


(3a) 

U x (x,y, 

(3b) 

U;(x, y, t) 3 O 

(3c) 

VAx, y, t) = M UAx, y, t) + UAx, y, /)]• 


Clearly, (3a, b, and c) are just the forward , backward , and centered 
difference quotients with respect to x. By our previous convention we might 
have written the left-hand sides of (3) as just U x , U x , and U$. This con¬ 
venient notation was introduced by Courant, Friedrichs, and Lewy in a 
fundamental paper on difference methods for partial differential equations. 
The difference quotients with respect to other discrete variables are defined 
in analogy with (3), say U y , Uu etc. It is a simple matter to verify that 
these difference operators commute; i.e., 

U X y k't etC. 

A particularly important case is the centered second difference quotient 
which can be written as 

(4) U yy = U- yy = [(/(*, y + 8y, 0-2 U+ U(x, y - Sy, t)]. 


1. LAPLACE EQUATION IN A RECTANGLE 


A standard type of problem which employs the Laplace operator or 
Laplacian, 



d 2 


dy [ 


. 2 ’ 


is to determine a function, w(x, y ), such that 
(1 a) -Au(x, y) = f(x, y), (x, y) e D; 

(lb) u(x, y) = g(x, y), (x, y) e C. 
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Here D is some domain in the x, ^-plane and C is its boundary. If the 
boundary C and inhomogeneous terms /(x, y) and g(x , y) satisfy mild 
regularity conditions, it is well known that the problem (1) is well-posed. 
If / = 0 this is called a Dirichlet problem for Laplace’s equation while 
in its present form (la) is called the Poisson equation. For simplicity of 
presentation, we take D to be a rectangle 

(2a) D = {(x, y) | 0 < * < a, 0 < y < b}\ 

whose boundary C is composed of four line segments 

(2b) C = {(x, y) | x - 0, a, 0 < y < b; y = 0, b, 0 < x < a). 

To “solve” this problem numerically we introduce the net spacings 
Sx = a/(J + 1), 8y — b/(K -f 1), and the uniformly spaced net points 

Xj — j&x, y k = k&y; j, k — 0, ±1, ± 2 , . . .. 

Those net points interior to D we call D 6 , i.e., 

(3a) D d = {(x } ; y k )\ \ <j <J\ \ <k < K). 

The net points on C, with the exception of the four corners of C, we call 
C d , i.e., 

(3b) C 6 = {(*,, y k ) \j = (0,J+\),\<k<K; 

k = (0, K + 1), 1 <j <J}. 


At the net points D b + C 6 we seek quantities U(x h y k ) which are to 
approximate the solution u(x h y k ) of (1). The net function will, of course, 
be defined as the solution of a system of difference equations that replaces 
the partial differential equation (la) and boundary conditions (lb) on the 
net. 

An obvious approximation to the Laplacian is obtained by replacing each 
second derivative by a centered second difference quotient. Thus at each 
point (x, y) e D d we define 

(4a) A 6 U(x, y) = U xx (x, y) + U yy (x, y). 

In the subscript notation, we could also write for each y fc ) e D t 


AH = 


2 Uuk + 


+ Uj.k-l 


It is frequently convenient with either of these notations to indicate the 
net points involved in the definition of A 6 U by means of a diagram as in 
Figure 1. The set of points marked with crosses is called the star or stencil 


associated with the difference operator A^. 
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With the notation (4a), we write the difference problem as 

(5a) -A 6 U(x, y) = fix, y), (x, y)£D 6 - 

(5b) U{x, y) = g(x, y), (x, y) e C 6 . 

From (3), (4), and (5) we find that the values U at JK 4 2(7 4* K) net 
points in D b 4- C 6 satisfy JK + 2(7 4 K) linear equations. Hence, we 
may hope to solve (5) for the unknowns U(x , y) in D d 4 C 6 . The 2(7 4- A^) 
values of U on C 6 are specified in (5b) and so the JK equations of (5a) 
must determine the remaining JK unknowns. We shall first show that this 
system has a unique solution and then we will estimate the error in the 
approximation. Finally, we shall consider practical methods for solving 
the linear system (5). 

To demonstrate that the difference equations have a unique solution 
we shall prove that the corresponding homogeneous system has only the 
trivial solution. For this purpose and for the error estimates to be obtained, 
we first prove a maximum principle for the operator 

THEOREM 1. 

(a) If V(x , y) is a net function defined on D d 4 C d and satisfies 

A d V(x, y) > 0 for all (x, y) e D d , 

then 

max K(x, y) < max K(x, y). 

Dt C<j 

(b) Alternatively , if V satisfies 

A d V(x, y) < 0 for all (x, y) e D d , 

min F(x, y) > min F(x, y). 

d a c a 


then 
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Proof. We prove part (a) by contradiction. Assume that at some point 
P 0 = (x*, y *) of D 6 , we have V(P 0 ) = M where 

M > V(P) for all PeD d and M > V(P) for all P e C 6 . 

Let us introduce the notation P 1 = (x* + 8x, y *), P 2 = (x* — Sx, >>*), 
P 3 = (x*, y* + P± = (x*, y* — Sy) and then use (4) to write 

A*L(/> 0 ) s 0 X [V(PJ + V(P 2 )\ + ^[L(P 3 ) + V(P,)\ - 2(0 x + 6 y )V(P 0 ) 
where 6 X = l/(Sx) 2 and = 1/(S^) 2 . However, by hypothesis A^Pq) > 0, 
so we have 


M = V(P 0 ) < 


V(Pi) + V(P 2 ) 


V(P 3 ) + V(PJ] 


6 X + 6 y 


But M > V implies that K(P V ) = M for v = 1, 2, 3, 4. We now repeat 
this argument for each interior point P v instead of the point P 0 . By 
repetition, each point of D d and C 6 appears as one of the P v for some 
corresponding P 0 . Thus, we conclude that 

V(P) = M for all P in D d 4- C dy 

which contradicts the assumption that V < M on C 6 . Part (a) of the 
theorem follows.f 

To prove part (b), we could repeat an argument similar to the above. 
However, it is simpler to recall that 

max [- V(x, y)] = - min K(x, y); A 6 (- V) = -A,(F). 

Hence, if V satisfies the hypothesis of part (b), then — V satisfies the 
hypothesis of part (a). But the conclusion of part (a) for — V is identical 
to the conclusion of part (b) for V.f ■ 

Let us now consider the homogeneous system corresponding to (5); 
i.e.,/ = g = 0. From Theorem 1, it follows that the max and min of the 
solution of this homogeneous system vanish; hence, the only solution is the 
trivial one. Thus it follows by the alternative principle for linear systems 
that (5) has a unique solution for arbitrary /(x, y) and g(x, y). 

A bound for the solution of the difference equation (5) can also be 
obtained by an appropriate application of the maximum principle . The 
result, called an a priori estimate , may be stated as 

theorem 2. Let V(Xy y) be any net function defined on the sets D d and 
C 6 defined by (3). Then 

cP 

(6) max | V\ < max \V\ + -zr max [A^K[. 

Da Ca 2. Da 


t We have in fact proved more; namely, that if the maximum, in case (a), or the 
minimum, in case (b), of V(x , y) occurs in D 6 , then V(x y y) is constant on D d + C 6 . 
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Proof\ We introduce the function 


<Kx, y) = ix 2 

and observe that for all (x, y)e D d + C 6i 

0 < 4>(x, y) < y; A t </>(x, y) = 1. 

Now define the two net functions K + (x, y ) and V_(x, y) by 
y±(x, y) = ± v(x, y) + y), 

where 

N = max \A d V\. 

Clearly for all (x, y) e D d , it follows that 

A,F ± (x, y) = ±A a V(x,y) + N > 0. 

Thus we may apply the maximum principle, part (a) of Theorem 1, to 
each of K ± (x, y) to obtain for all (x, y) e D d , 

V±(x 9 y) < max V ± (x, y) 

C 6 

cP 

= max [± V(x, y) + < max [± K(x, >>)] + A r ^r- 

c 6 c a l 


But from the definition of V ± and the fact that </> > 0, 
± V(x 9 y) < V ± (x, y). 

Hence, 

oP 

± F(x, y) < max [+ F(x, >»)] 4- 

c 6 l 


< max I V\ + N. 
c 6 1 


Since the right-hand side in the final inequality is independent of (x, y) 
in D d the theorem follows. ■ 

Note that we could readily replace a 2 /2 in (6) by b 2 /2 since the function 
0(x, y) = y 2 /2 can be used in place of <f>(x , y) in the proof of the theorem. 

It is now a simple matter to estimate the error U — u. We introduce the 
local truncation error , r{0}, for the difference operator A d on D d by 

(7) r{ch(x, y)} = A {5 0(x, y) - A<P(x, y), (x, y) in D d9 

where d)(x, y) is any sufficiently smooth function defined on D. Now if 
w(x, y) is the solution of the boundary value problem (1) we have from 
(la) at the points of D d 


-A d u(x, y) = f(x, y) - t{u(x, y)}. 
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Subtracting this from (5a) at each point of D d yields 

(8a) -A 6 [U(x, .y) - u(x, >’)] = t{u(x, >')}, (x, .y) in D t . 

Also from (lb) and (5b) we obtain 

(8b) U(x, y) - u(x, y) = 0, (x, y) in C„. 

Now apply Theorem 2 to the net function U(x, y) — u(x, y) and we get 
by (8) 

max | U(x, y) - u{x, y)| < max |t{h}|. 

D d ^ D d 

Upon introducing the maximum norm defined for any net function 
W(x, y) by \\W\\ — max | W |, we have the 

D d 

corollary. With u , U , and r defined respectively by (1), (5), and (7), 
we have 

(9) II U{x, y) - u(x, y)|| < j ||r{«}||. ■ 


Note that the error bound is proportional to the truncation error! 

It is easy to estimate ||r||. If the solution u(x, y) of (I) has continuous 
and bounded fourth order partial derivatives in D , then 


( 10 ) 


u(x ± 8x, y) = u(x, y) ± 8x y) + 


(8x) 2 d 2 u(x, y) 
2! 0x 2 


(Sx) 3 d 3 u(x, y) (Sx) 4 d 4 u(x 4- 6 ± 8x, y) 
* “3! dx~ + ~4! dx* ’ 

\eJ < l. 


1*1 < i, 


Thus we find, as in Chapter 6, that 

,, fr ,a d 2 u(x, y) (8x) 2 d*u(x + 68x, y) 

xA ' y) dx 2 12 dx* 

with a similar result for the y derivatives. Hence, 

t{u(x, y)} = A s u(x, y) - A u(x, y) 

- T5 [v*y‘ s '** + J ix ’ y) + 


If we denote the bounds of the respective fourth order derivatives by 
Mi 4) and M ( y 4 \ then 

(1 la) H«}|| < -M8x) 2 M^ + (8y) 2 M“>). 
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If u{x , y) has only continuous third order derivatives , we terminate the 
expansions in (10) one term earlier and get 

„ (Y lA _ ^(x, y) _ ^ \ d 3 u(x 4- 0 + Sx, y) _ d 3 u(x + g-gx, y) 1 

5x 2 3! [ 5x 3 ax 3 J* 

If the moduli of continuity in D of the third derivatives d 3 u/dx 3 and 
8 3 u/dy 3 are denoted by and a/ v 3) (S), respectively, we have 

(Hb) \\r{u}\\ < l[8xa>™(28x) + Sy«>?(28y)]. 

Clearly by these procedures, we find that if u(x , y) has only continuous 
second derivatives with moduli of continuity co (2 \8) and a/ v 2) (8), then 

(11c) \\r{u}\\ < U>?\8x) + a>i 2 \8y). 

With the aid of any of the estimates (1 la, b, or c) that may be appropriate, 
the corollary establishes convergence of the approximate solution to the 
exact solution as Sx —> 0 and 8y —^ 0 in any manner. We see that the 
convergence rate is generally faster for “smoother” solutions u(x , y). 
For solutions which have more than four continuous derivatives, we can¬ 
not deduce better truncation error estimates than that given by (11a). 
It is possible to construct more accurate difference approximations to 
the Laplacian, which then have solutions U of greater accuracy than 
0[(Sx) 2 + (Sy) 2 ]. there is no general way of constructing convergence 
proofs for similar schemes of arbitrarily high order truncation error. In 
fact, it is unlikely that such schemes, which are of maximum order of 
accuracy, converge in general. 

The effects of roundoff can also be estimated by means of Theorem 2. 
Let the numbers actually computed be denoted by £/(x, y). Then we can 
write 

(12a) -b 6 V(x, y) = f(x, y) + p(x, y), (x, y) e D d ; 

(12b) V(x, y) = g(x, y) + p'(x, y), (x, y)eC 6 . 

Here p'(x , y) is the roundoff error in approximating the boundary data. 
After noting that the coefficients in & d are proportional to l/(Sx) 2 and 
l/(Sj>) 2 , we have defined the roundoff errors, p(x, ^y), in the computations 
(12a) to be proportional to a similar factor. This corresponds to the fact 
that the actual computations are done with the form of (4) which results 
after multiplication by the factor 8x8y. We now obtain from (1), (7) 
and (12) 

-\[U(x, y) - u(x, y)\ = t{u(x, y)\ + (x, y) e D 6 ; 


U(x, y) - u(x, y) = p'(x, y), (x, y) e C 6 . 
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Thus for the net function U(x , y) - u(x , y). Theorem 2 implies 

theorem 3. With u , £/, and r defined by ( 1), (12), am/ (7) respectively , we 
have 

(13) 127(jc, jO - «(*, >011 < Ip'! + ^ [H«}| + ^ II Hi]- 

Here 

||p'|| = max |p'(x, j»)| and ||p|| = max |p(x, y)|. ■ 

C() D d 

Thus we find that the boundary roundoff error and the interior roundoff 
error have quite different effects on the accuracy as Sx and Sy 0. 
In fact, to be consistent with the truncation error, the interior roundoff 
error, p, should be of the same order as Sx S yr{u) and the boundary round¬ 
off error, p\ should be of the same order as r when Sx and 8y 0. This 
result for p is analogous to that in (7.34), of Chapter 8 where simple differ¬ 
ence approximations of an ordinary boundary value problem were 
considered. 

The maximum principle and its applications given here can be generalized 
in various ways (see Problems 1-4). Extensions to rectangular domains in 
higher dimensions are straightforward, and non-rectangular domains may 
also be treated (with suitable modifications of the difference equations 
near the boundary surface). 


1.1. Matrix Formulation 

The system of linear equations (5) can be written in matrix-vector nota¬ 
tion in various ways. For this purpose, we use the subscript notation 
for any net function K(x, y) defined on D 6 4- C b 


V(x h y k )= V jk ; 0 < j < J + 1, 0 < k < K + 1. 


From the values of such a net function, construct the 7-dimensional 
vector 


(14a) 



k = 0,1,2,K + 1. 


Each vector V fc consists of the elements of the net function F(x y , y k ) 
on the coordinate line segment y = y k9 x 1 < x < x 7 . (We note that the 
elements on the line segments y 1 < y < y K , x = x 0 , and x = x / + i are 
not included.) We also introduce the 7th order square matrices 
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r i 


h = 


= (8./), 


(14b) 


Lj — 


0 

1 0 


1 0 


= (Qij), a a = 


and the quantities 
(14c) S 2 = 


(Sx) 2 (8 7 ) 2 


6> = 


(W 


2[(8x) 2 + (Sy) 2 ] * “ 2[(Sx) 2 + (Sy) 2 ] 

(8x) 2 

* “ 2[(Sx) 2 + (8y) 2 ]* 

Upon multiplying (5a) by S 2 , we can write the result for (,v, y) = (x h 
in subscript notation as 

( 15 ) u ik - e x {u^ Uk + u j+Uk ) - 6 y (u Uk ^ + u,-. k+1 ) = s 2 / yfc} 

1 < j < J, 1 < k < K. 
Or with the vector and matrix notation of (14) this system becomes 

[Ij - O x (Lj + L/)]U 2 - e y v 2 = 8 2 F 1? 

(16a) + [/, - »JLj + L/)]U fc - 0 y U fo + 1 = S 2 F fc , 

2 < k < K - 1; 

-flyU*., + [/, - 0*(L 7 + L/)]U* = 8 2 F*. 

Here we have introduced 

F 1 = f l + Wj + 7o~To U, 


( 8 a :) 2 1 (hy) 


2 ^0, 


(16b) 


F fc “ ffc + 


1 


( 8 *) : 


2 W *> 


2 < k < K - 1, 


F * “ f * + (S^ 2 W * + (S^ U * + 1 ’ 


J’k). 
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where 

'U 0k 
0 

W. fc = (w tfc ) = : 

0 

„ uJ+l t k 

i.e., w ik = 0 for 2 < i < J — 1; w lk = U 0k ; w Jk = U J + l k . Of course, 
all of the U jk which enter into (16b) are known quantities given in (5b). 

Further simplification is obtained by introducing TK-dimensional 
vectors or A^-dimensional compound vectors (i.e., vectors whose components 
are 7-dimensional vectors) 



and the square matrices of order JK 


l - ( 8 V ), 


I 

t-i’ 

t-i 

1 

i 

f 0 

// o 

' 

(17b) L = 

t 

l 


, B = 


Ij 0_ 


H = e x (L -fhL 3 ), V = ^ v (5 + 5 r ), As I - H - V. 


Now the system (16), or equivalently (5), can be written as 
(18) AJJ = 8 2 F. 


The vectors in (17a) associate a component with each net point (x, y) of 
D 6 . In the indicated vectors of dimension N = JK , the rth component 
is the value of the net function at the point (jc,, y k ) such that r = j -f 
(k — 1)7. If the assignment of integers, r, to net points of D d is done in 
some other order, then the vectors and matrices are changed by some 
permutation. (Another ordering of interest would be to list the elements on 
lines x = constant of D d ). The previous proof that the system (5) has a 
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unique solution now implies that A is non-singular. We shall prove this 
fact directly by showing that the eigenvalues of A are positive. 

Let us consider again the problem of obtaining error estimates for the 
approximate solution. Multiply (8a) by B 2 and employ the present notation 
to obtain in place of the system (8) 

(19a) A{ U - u) = S 2 t. 

Here U is as before, u is the vector of the exact solution on D 6 , and t 
is the vector of local truncation errors r{w(.x:, >>)} on D 6 with no adjust¬ 
ments now required as in (16b) since U — u = 0 on C 6 . Then as A is 
non-singular, we have 

(19b) U — u = S 2 A~ 1 t 9 

which is an exact representation for the error. By using any vector norm 
and the corresponding natural matrix norm, we have from (19b), 

(20) ||U — u|| < 8 2 ||/4 -1 || ■ |t 1. 

We note from (17b), that A = A T and thus A -I is symmetric. If we 
use the Euclidean norm in (20), i.e., for any vector v, 



then 

||/»- 1 || 2 = max (1/|A v |) = 1/ min |A V | 

i<v</a: i<v<,jk 

where the A v are the eigenvalues of A. The eigenvalues of A satisfy 
(21) AW = AW. 

However, we see that this is equivalent to the finite difference eigenvalue 
problem 

(22a) y) = ^ fV(x, y), (x, y) in D„ 

(22b) W{x, y) = 0, (x, y) in C t , 

since multiplication of (22a) by S 2 yields (21). 

We determine the eigenvalues of problem (21) by using the technique 
called separation of variables for (22). Let us try to find a solution of the 
form W(x,y) = </>(x)ifj(y) of (22a), i.e., 

- a MxWy) = -txz'Ky) - 

Now divide by lV(x, y) to get 
$xx _ > ht 

m m s 2 ’ 


(x, y) in D„. 




456 PARTIAL DIFFERENTIAL EQUATIONS 


[Ch. 9] 


But the only way that the sum of a function of jc and a function of y can 
be constant is for each function to be a constant. Hence we may write 
A = £ 4- r) and have the two sets of equations 

(23a) -<f> xx (x) = | m 

C X,, y) in D 6 

(23b) | 4>(y) 

If f and 7] are known, (23) would be ordinary difference equations of second 
order with constant coefficients. We solve them as we did the difference 
equations in Section 4 of Chapter 8. Thus, let us use the form </>(x) = a x 
in (23a) to get by using (14c), 

(k? + ( 2 " I) ~ = °> S * < x < a - »x. 

If we set co — a 6x , then these equations are satisfied provided 



Furthermore, it is clear that <f>(x) = a~ x yields the same condition, and 
hence the general solution of (23a) is of the form 

<£(x,) = ca x i 4- da~ x i = cco j + dco~ j . 

To satisfy the boundary conditions (22b), we have cj>(x)^(y) — 0 for 
( x, y) in C 6 . This implies that 

(24) *(0) = <Kx 0 ) = 0; 4>(a) - <f>(x J + 1 ) = 0. 

From the condition (24) at j = 0 , we have c = —d\ hence at j — J 4 - 1, 

OJ 2(/+1) = 1. 

The 2(7 4- 1) roots of this equation are the roots of unity 
oi p = p = 1,2,..., 2(7 + 1). 

However, if we replace co by to” 1 , the solution <j>{x) of the difference equa¬ 
tion becomes — <f>(x). Hence, we need consider only the first 7 4- 1 such 
roots. But the (7 4- l)st root is to = — 1 which leads to the trivial solution, 
</> = 0. Thus we have found 7 non-trivial solutions of (23a) which satisfy 
(24) and they are 

(25a) <f> v (x } ) = c(oj/ - a> p ->) = sin (/ 

P = 1,2. 

(25b) f„-2»,(l-co, J & T ),4»,si„-g 7 f [ ) 
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Here we have chosen the arbitrary normalization constant of <f>(x ) to be 
c = —//2. In an exactly analogous manner, we find K non-trivial solutions 
of (23b) which satisfy 0(0) = 0(6) = 0; 

(26a) <p Q (y k ) = sin (k ^ y ) 

q=\,2,...,K. 

(26b) Vq = 46 y sin 2 (| 

By combining these results, we find the solutions of the eigenvalue 
problem (22) 

W PtQ (x, - 0 P (x)0 5 (^) 

(27) 1 < p < J, 1 < q < K. 

Ap.q = ip + rf q 

We have thus found JK different eigenfunctions W P Q (x, y), with corre¬ 
sponding eigenvalues A p #Q (which may not all be distinct). In the vector 
representation of the net functions W PtQ (x, y ), we have JK distinct eigen¬ 
vectors, W Pi? , of the matrix A in (21). [In fact, it can be shown that the 
JK eigenvectors in (27) are orthogonal.] Hence, all of the eigenvalues of 
A are in the set A P Q . We observe that all eigenvalues of A are positive and 
A is not only non-singular, but is also positive definite . 

The norm of A~ l is now found to be 

IM “ l ||2 = [min (£„ + t?,)]" 1 = — 

P.Q ?1 + Vl 

- [«*.»»■ (£«*) 1 (5 h ] ' 

- sb (ftp ) 11 + "I* 8 '* 2 + < 8 ^ 2 »- 

Thus the error estimate in (20) becomes in this norm, 

(28) |U - a|| a < - 2( j - ^ - ^ 2) IM| 2 .{1 + 0[(8 xf + (8y) 2 ]}. 

This bound is similar to that in (9) but it must be recalled that the norms 
are different. We have presented here a convergence proof which is in¬ 
dependent of the maximum principle. There are still other proofs that 
could have been given. In particular, if we were to consider the problem 
(1) with f(x, y) = 0 on Z), then the solution could easily be written in 
terms of Fourier series [assuming g(x , y) to have piecewise continuous 
derivatives on C]. The solution of the corresponding difference problem 
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(5), with /(x, y ) = 0, can also be given in terms of (finite) Fourier series. 
A comparison of these explicit solutions would then show convergence 
as 8x and vanish, and the rate of convergence would depend upon the 
smoothness properties of the boundary data, g(x, y ). Of course, the 
determination of the explicit solutions used in the calculation of ||/f _1 || 2 
cannot be made in most of the applications of the present difference method. 
In particular, if the domain is not composed of coordinate lines and/or 
if the equation is replaced by one with variable coefficients, then these 
special methods must be modified to give analogous results. However, the 
maximum principle is readily extended to include many such applications. 
Often it may be possible to obtain a bound on ||/f _1 || (in some norm) 
without having to determine the eigenvalues of A. 


1.2. An Eigenvalue Problem for the Laplacian Operator 

In view of the development of the previous subsection, we can readily 
find approximations to the eigenfunctions , u(x , y), and eigenvalues , A, of the 
Laplacian operator for a rectangular region. The eigenfunction is not 
identically zero, i.e., u ^ 0, and for some constant, A, (the eigenvalue) 
satisfies 


(29a) —A u = Aw, (x, y) in Z), 

(29b) u = 0, (x, y) in C. 

We can solve this continuous problem by the separation of variables 
technique. Thus we set 

« = f(x)g(y), 

whence from (29a) 


while from (29b) 

/(0) = f{a) = g(0) = g(b) — 0. 


But since A is a constant, we find that 
/" 

—~j = constant, 

g" 

— — = constant, 
g 


0 < x < a, 
0 < y < b. 


The only possible non-trivial solutions of these differential equations and 
boundary conditions are proportional to 


(30a) 

fm(x) = sin 

H) 

(30b) 

g„(y) = sin 

H) 
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Hence the eigenfunctions and eigenvalues of (29) are 
(31a) u m . n (x, y) = sin sin (nn 

(31b) A m . n = m, n = 1,2. 

[It can be shown that these are all of the independent eigenfunctions and 
eigenvalues of (29).] We now exhibit the eigenfunctions U and eigenvalues 
fj, of the approximating difference equations defined by U & 0, 

(32a) -A 6 U = nU, (x, y) in D 6 

(32b) U = 0, (x, y) in C 6 . 

That is, from (27), (26), and (25), with /x = A/5 2 , 

(33a) t/ p . „(*,•, y k ) = sin (/ sin (k -^ ^ 


(33b) 


Q 



!</?</, 1 < q < K. 


From (31a) and (33a), we note that 


(34a) u P . q (X), y k ) = U p , q {x„ y k ). 

This is an exceptional coincidence! On the other hand, if we use (31b) 
and expand (33b) for fixed ( p , q) and large (7, A"), we find 

(34b) /*»., - = &[p\hxf + q\8yn 

Equation (34b) expresses the fact, also valid in more general problems, 
that the lowest eigenvalues of the difference operator approximate the 
respective lowest eigenvalues of the differential operator with an error 
proportional to the square of the mesh width. Frequently the error in the 
approximation of the corresponding eigenfunctions is also proportional 
to the square of the mesh width. 

In most cases, where the eigenvalues of the differential operator obey a 
variational principle, the practical problem of determining the eigenvalues 
of the difference operator is made simpler by characterizing them as the 
stationary values of some functional. 
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For example, in the case of (29) the eigenvalues are the stationary values, 
A = G[w], of 


(35) 


G[u] = 


u [@Htrw 




dx dy 


where u(x , y ) ranges over the class ^ of non-trivial functions with con¬ 
tinuous first derivatives and such that u = 0 on C. We say G [m] is stationary 
at w, if 

Jr G [u + €v] — 0 at € = 0, 


for all functions v in %. It can be shown that if A = G [w] is stationary at u , 
then u has continuous second derivatives and satisfies (29). On the other 
hand, the corresponding functional that characterizes the eigenvalues of the 
difference operator in (32) is 


(36) 


H[U] = 


Q[U 1 __ i IKU X ) 2 + {u*Y + {Uyf + (uj 2 ] 
L[U] ” ZU 2 


The sums in (36) are taken over all net points of the infinite lattice that 
covers the plane and U is in the class of non-trivial net functions which 
satisfy 

U(x , y) = 0 for (x, y) not in D 6 . 

theorem 4. (i — H[U] is stationary at U iff ^ and U are an eigenvalue 
and eigenfunction that satisfy (32). 


Proof Let 
(37) 


V(x 0 , y 0 ) - l, 


V(x, y) = 0 


It is easy to calculate 


(*, y) # (*<>, y 0 )- 


i-H[U + *V] 

de t = o 

by expanding the numerator and denominator of H to first order in e. 
That is. 


Q[U + CV] £ Q[U] + * 2 U X V X + U X V X + UyVy + UyVy 
= Q[U] - 2^ 6 U(x 0 , y o y, 

L[U + e V ] £L[t/] + 2*2 UV 

= L[U] + 2*t/(*o, y 0 ). 
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Hence, 

(38) H[U + <V]^ H[U] - ^ [A 4 £/(* 0 , y 0 ) + ^U{x 0 , >- 0 )]. 

Therefore, if H[U] is stationary, (32) holds, since we may pick (x 0 , j 0 ) 
to be any point in D d . On the other hand, in Problem 5 it is shown that 
(32) implies H[U] is stationary. ■ 

We remark that the variational principles can be used as a basis for 
constructing methods to determine the eigenfunctions and eigenvalues 
as in the Rayleigh-Ritz methods, which we do not treat. Another appli¬ 
cation of the related functionals (e.g,, H[u] and G[u]) is to determine 
estimates for n PQ — A p Q in more general cases. But we cannot pursue 
this topic further. 

The eigenvalue problem (32) corresponds to the matrix eigenvalue 
problem, similar to (21), 

AU = fxS 2 U, 

where A is symmetric. Hence we may use the argument of Theorem 7.2 
of Chapter 8 to prove a result analogous to that cited therein. 


PROBLEMS, SECTION I 

In Problems 1, 2, and 3 we indicate how to generalize Theorems 1, 2, and 3 
for a non-rectangular region. For example, let 

D {(jc, y)\ x 2 + y 2 < a 2 }, C = {(x, y) \ x 2 + y 2 = a 2 }; 

in the notation of Theorem 1, let P be any lattice point and define 
D d ^ {P | P, P u P 2 , P 3y P 4 e D}. 

Now, if P e D but P f D dy we set P e C 6 and note that at least one pair of 
its opposite neighbors is separated by C, say 

P11 D y P 2 e D. 

Let P c e C be on the line segment PP X \ let 8 = distance PP C , therefore, 
0 < 0 < Sx. Define U(P C ) = w(P c ) for any point P c on C. 

1. Maximum Principle : In the above notation, for PsD but P$D dy 
define the linear interpolation operator 


Show that 
(a) If 

and 


„ r(n _ eu(p 2 ) + hxu{p c ) 
B 6 U(P) =-- 

A d U(P) > 0, for PeD d , 
B 6 U{P)> 0 , for P e C 6y 


max U(P) < max U(P C )< 

PeD P c eC 


then 
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(b) If 


then 


A a U(P) < 0, for P s D A \ 
B d U(P) < 0, for PeC 6 , 

min U(P) > min U(P C ). 

PeD P c eC 


(c) The equations 

-A 6 U(P)=f(P), P e D 6i 

B d U(P) = g(P), PeC A , 

have a unique solution. 

2. With the linear interpolation operator , (i.e., B d U(P)), prove the a priori 
estimate , for lattice points in D , and any lattice function U(P ), 

max |l/(P)| < max | C/(/*c)| + %■ K, 

PeDa P c eC } 

where 

K = max [max | 6/(.P)|, 

3. Derive a bound for the error, E — U - «, when U is found with rounding 
errors p, pi that satisfy 

-A *U(P) = f(P) + PeD 6 

B 6 U(P) = p u P e C d . 

If u has continuous derivatives of fourth order and 8x — (9(h), 8y — (9(h) , 
show that 

max | E(P)\ = (9(h 2 ), 

PeD 


max 

PeCd 


\B 6 U(P)\ . 


for sufficiently small p, pi. 

[Hint: Define the truncation error as in (7) for Pg D d . Otherwise, set 
t{u(P)} = B 6 u(P), P e C 6 . 

Apply the a priori estimate to E(P), for P e D.] 

4 . Show how the statements of the maximum principle, the a priori estimate, 
and the error bound must be modified for a more general bounded domain D. 

5. * With U and V in the class 9F of Theorem 4, show that 

2 U X V X + U x Vx + UyVy + UyVy = V^ 6 U. 

Hence 

f/[t/ + *K] s - A-, 2 m6U + /xt/). 

Therefore if U satisfies (32), H[U] is stationary. 

[Hint: Use summation by parts to remove the difference quotients of V. 
For example, in 2 O x V x , the value V(P) for a fixed Pg D d occurs only in 
V X (P 2 ) and V X (P). Thus in this sum the coefficient of V(P) is found to be: 

- mp 2 ) - 2U(P) + uiPjySx 2 .] 
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The linear algebraic system of equations determined by the difference 
scheme (1.5) is, for the rectangular region, of order JK. For small net 
spacings 8x and this may be extremely large since JK = constant/ 
(Sx £}>)• (In practice, JK > 2500 is not at all unusual.) Thus the standard 
elimination procedures for the equivalent system (1.18) of order JK require 
on the order of (JK) 3 operations for solution and are too inefficient. 
Now from the definition (1.17b) of A , we see that many of its elements 
are zero and in fact, that it is block tridiagonal. The Gaussian elimination 
procedures which take account of large blocks of zero elements (in particu¬ 
lar, the methods of Subsection 3.3 in Chapter 2) are then naturally 
suggested. This block elimination method requires at most on the order 
of J 3 K operations (for rectangular regions) and is efficiently carried out 
on modern digital computers. (The storage requirements are for 2K — 1 
matrices of order J and one vector of order JK. But this data is used only 
in dealing with systems of order J and hence is not all required 
at the same time.) In fact, since only tridiagonal systems need to be solved, 
efficient organization requires only (P(J 2 K ) operations! 

Nevertheless, iterative methods seem to be the ones most often employed 
to solve the Laplace difference equations. Again, the large number of 
zero elements in the coefficient matrix greatly reduces the computational 
effort required in each iteration. However, some care must be taken to 
insure that sufficient accuracy will be obtained in a “reasonable” number of 
iterations. We consider such methods for the rectangular region. 

The simplest iteration method begins with an initial estimate of the 
solution, say U (0 \ and then defines the sequence of net functions U (v) by 

(la) U (v+1) (x, y) = £/ (v) (x, y) 4- 8 2 A d U iv \x, y) 4- S 2 /(x, y ), 


(x, y) in D 6 , 

(lb) U (v + 1) (x , y) - g(x, y), (x, y) in C 6 , v = 0, 1,.... 

Here S 2 is defined in (1.14c) and the boundary condition (1.1b) is to be 
satisfied by U (0 \ From the other definitions in (1.14c) and (1.4), we find 
that (la) is, in subscript notation, 

(2) Utt l) = O x (U<?l ltk 4- uyh.fd + Ov{U ?\-1 + £/ ( A + i) + S 2 /y.*, 


\ <j <J, l <k < K. 


[Note the relation between (2) and (1.15).] The calculations required in 
(1) or equivalently (2) can be carried out in any order on the net. 

This iteration scheme is easily written in matrix form by using the 
notation of the previous subsection. We get that 


(3a) U (v + 1) = (H 4- K)U <V) 4- 8 2 F, v = 0, 1,.... 
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Here the JK -order matrices are defined by (1.14b) and (1.17b), F is defined 
in (1.16b) and (1.17a), and U (v) is the vector with components ordered 
as in (1.17a). From the definition of A in (1.17b), we see that (3a) can be 
written as 

(3b) U (v + 1) = (/ - A) U (v) + S 2 F, 

and thus, this scheme applied to (1.18) is a special case of the general 
iterative methods studied in Section 4 of Chapter 2. In fact, this is just the 
Jacobi or simultaneous iteration scheme of Subsection 4.1 in Chapter 2 
applied to the system (1.18). 

From the general theory of iterative methods, we know that the neces¬ 
sary and sufficient condition for the convergence of the sequence {U lv> } 
to the solution U for an arbitrary initial guess U (0) is that all of the eigen¬ 
values of (H + V) are in magnitude less than unity (see Theorem 4.1 of 
Chapter 2). The eigenvalues of this matrix are the roots of the charac¬ 
teristic polynomial 

(4) Wirj) = det |??/ - (H + V)\ = det \ V I - (/ - A )|. 

However, we have determined the eigenvalues of A in the previous sub¬ 
section ; they are given in (1.27). Thus the eigenvalues of (/ — A), and hence 
the roots of T^t?) = 0, are 

(5a) 7) = Tj pq = 1 A p q 

= 1 - 4e * si » 2 ( Irh ) - 4e > si " 2 ( If + t )’ 

1 < p < J 9 1 < q < K. 

Now we easily find that - 1 < rj < 1 and for small Sx and Sy 
(5b) p(H + V) = max |t? p J = rj ltl = 1 - A 1(1 

P, Q 

= i - S 2 " 2 (^ + p) + ^(S 4 )- 

Since 0 < A 1JL < 1, the method clearly converges and the rate of conver¬ 
gence is by (4.11) of Chapter 2 

(6) Rj = -log(l - A U1 ) 

- * v (? + p) + 

We find that the rate of convergence decreases with 

2[(8x) 2 + (Zyf]’ 
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and thus for difference equations with a small net spacing we may expect 
very slow convergence. 

The Gauss-Seidel or successive iteration method for the Laplace differ¬ 
ence equations can be written as 


(7a) 


u%i l> = W/-VX + £/«!.*) + 


d y {ur^\ + up k+ 1) + S 2 A, 


1 < j < y, 1 < k < K. 


In matrix form this becomes, with the use of (1.17b), 

(7b) [/ - ( 6 X L + 0 y £)]U (v+1> - (9 X L + 6 y B) T U (v) + S 2 F. 

In the present application this iteration scheme is frequently called the 
Liebmann method. The new iterates cannot be evaluated in a completely 
arbitrary order in this method. We first compute C/V! 1 1> and then, in order, 
the other elements on the coordinate lines with j = 1 and k = 1. Next 
C/^ 2 1) i s determined, etc. By slight changes in the scheme we could start the 
calculations at either of the other three “corners” in D d . However, as we 
shall see, all of these methods have the same rate of convergence. This 
successive scheme is easier to employ on a digital computer than the 
simultaneous scheme since now each new component can immediately 
replace the previous value in storage. In addition, we shall find that the 
Gauss-Seidel method converges exactly twice as fast as the Jacobi method 
(when they are used on the same problem) and thus one should never use 
the Jacobi method on such difference problems. 

The convergence of the iteration method (7) is determined by the 
magnitude of the eigenvalues of the matrix 


(8) (0 X L + 9 y B)]~\e x L + 9 y B) T . 


The indicated inverse exists since 9 X L + B y B is a strictly lower triangular 
matrix. Thus the eigenvalues, £, of the matrix S 1 are the roots of the charac¬ 
teristic polynomial 


(9) <&,(£) = det |f[/ - (i 9 X L + S y B)] - (9 X L + 9 y B) T \ 


= det \U - 9 X (£L + L T ) - 9 y ({B 4- B T )\. 


To examine the roots of this polynomial we shall use the following 


theorem I. Let the matrices L, B , and A be defined as in (1.14b) and 
(1.17b). Then for any non-zero scalars a and 

(10) det \A\ = det |7 - 9 x (aL + a l L T ) - 9 v (pB + i3“ 1 5 T )|. 

Proof Let the elements of A be a r s where r, s = 1, 2,. . ., N = JK. 
Then each term in the formal expansion of det \A\ is given by a product 
of the form 


' ’ a r.n{r)' * ' a N,n(Ny 
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Here tt is one of the AM permutations of the first N integers. Let each 
point (x ; , y k ) of the net D d be identified with a unique integer (see Figure 1) 

(Xj, y k ) r = j + (k — \)J. 

Then any given permutation v can be represented by N vectors on D d 
by drawing lines from r to n(r) for 1 < r < N [i.e., from the point corre¬ 
sponding to r to the point corresponding to 7r(r)]. Now by the definition 
of the matrix A it follows that a TtJl(r) ^ 0 only if r = n(r) or the point 
corresponding to 7i(r) is one of the four neighboring net points, (x ± Sx, y) 
or (x, y ± 8y), in the star about the point (x, y) corresponding to r. 
Thus, the only terms in the expansion of det \A\ which may not vanish 
correspond to permutations whose geometric representation is composed 
entirely of unit vectors in the ( ±x ) and (±j>) directions and null vectors. 

Now every permutation is a product of disjoint cycles and in the above 
representation a cycle is a closed path of vectors on (see Figure 1). 
Thus for any cycle corresponding to a non-vanishing product of elements, 
there must be the same number of unit vectors in the ( + x) direction as in the 
( —x) direction and similarly for the (±y) directions. Now we recall that 
a rtJl(r) is an element of L if 77(r) = r — 1 and is an element of U if n(r) = 
r + 1. Thus there are as many factors from L as from L T in any non- 


k 



Figure 1. Geometric representation of a non-vanishing cycle. In the permu¬ 
tation /i-*77-(tf), of which this cycle is a factor, r = j + (/c — 1)/ and the 
cycle is given by: 

7r(r) = r + J y rr{r + J) = r 4- 27,..., rr(r ~ 1) = r. 
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vanishing term in the expansion. Similarly, a rMr) is an element of B if 
7r(r) = r — J and of B T if tt(>) = r -f J. Thus factors from B and B T 
also enter pairwise in any non-vanishing term. 

These results hold for the expansion of the right-hand determinant in 
(10) if elements from L, L r , B , and B T are replaced by those of aL , a~ l L T y 
pB, and P~ 1 B T respectively. Thus in any non-vanishing term the scalars 
a and p do not appear and the proof is concluded. ■ 

We note that the proof of Theorem 1 depends only upon the location 
of the zero elements in the matrix A. Hence if the non-zero elements of A 
are changed in any manner we have the 

corollary. If p, £, 77 , £, and v are matrices with zero components wherever 
/, L, L T , B , and B T respectively have zero components , then 

det \p - (of + a-'rj) - (Pi + p~'v)\ = det \p - (( + 7? + £ + v)\. ■ 

In particular, we now consider the determinant <P(f) in (9). The identity 
I and the matrices L and B have been multiplied by a scalar f, so no zero 
elements of A have been altered. Thus we may apply the corollary to get 

4>i(0 - det I (I - BfagL + a-'L T ) - OfpgB + p^B T )\. 

Take a = p = g~ 1/2 and recall (4) to find that 

(Di(0 = det \g' /2 I\ - det |f*/ - (H + V)\ 

— f y ^ / 2 x F(f 1/2 ). 

Thus every non-zero root g of <!/(£) = 0 satisfies T/f 1 / 2 ) = 0 and every 
root r] of 'Ffo) = 0 satisfies O^ 2 ) = 0. So all non-zero eigenvalues of the 
matrix in ( 8 ) are given by 

i = V 2 = 0 “ V,) 2 , 1 < /> < /, 1 < q < K 

and from (5) we find that the maximum eigenvalue of Si is 
P(S 1 ) = [p(H + V )] 2 = (1 - A 1(1 ) 2 

= 1 - 28V(I + I) + 0(8*). 

The rate of convergence for the Gauss-Seidel scheme is thus 

(12) Rob = 28 2 - 2 (i + I) + 0(8*), 

or fvv/ce f/wf /or Jacobi scheme . 

The convergence rate of the Gauss-Seidel method (7) may be improved 




468 PARTIAL DIFFERENTIAL EQUATIONS [Ch. 9] 

by introducing an appropriate acceleration parameter , as discussed in 
Section 5 of Chapter 2. That is, set 

(13a) v\y^ = e x (UY-\ l \ + t/y? Uk ) + + i/ft +1 ) + 8 a /,. fc 

and then, at the point (x ; , j> fc ), take 
(13b) Un i) = + (1 - 

= t/5% + ai(Kjyj» - 

Here is the acceleration parameter to be determined. We note that for 
co = 1 this scheme reduces to that in (7a), i.e., to the ordinary Gauss- 
Seidel method. The order in which the components of the new iterates 
are to be computed is just as in the previous successive scheme. 

To examine the convergence of the accelerated Gauss-Seidel method we 
first write it in matrix form. Obviously (13a) implies 

(14a) V (v + 1) = (. e x L + 0„*)U (V+1) + (■ e x L 4- e y B) T U (v) + S 2 F, 
and (13b) implies 

(14b) U (v + 1) = a,V (v + 1) + (1 - oj)U ( v) . 

Upon eliminating V (v + 1) , we obtain 

(15) [/ - <o(d x L + 0 y B)} U (v + 1) 

= [(1 — o)/ -h a >(0 X L + M) r ]U (v) + w3 2 F. 

The convergence of these iterations is thus determined by the magnitude 
of the eigenvalues of the matrix 

( 16 ) $„ = [/- <o(6 x L + e y B)]~ l [( 1 - «)/ 4 - <o($ x L + 0 y *) r ]. 

Note that for tu = 1 the above matrix reduces to the S 1 defined in (8) 
for the ordinary unaccelerated successive iterations. The eigenvalues of 
are the roots ? of the characteristic polynomial 

(17) <M0 9 det [[/ - <o(Q x L + e y B)]l 

- [(1 - «)/ + to(6 x L + d y B) T }\ 

= det |(£ + o> - 1)7 - w£(0 x L + 6> y £) 

- W (^L + W|. 

The matrix in (17) has zero elements wherever the matrix A has them 
and so the corollary to Theorem 1 is applicable. If we use the scalars 
a = then we obtain from (17) and (4) 

<D«(0 = det |(£ + ai - 1)7 - ^(^ + 0 y 2?) - wJ^{B x L + Wl 
= det |<o£K/| - det ~ 7 - (H + V) 

- + ' )• 
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From this result we conclude, for each co ^ 0, that any non-zero root £ 
of O w (£) = 0 satisfies — 0 , and that every root r) of = 0 
satisfies $^(0 = 0 provided (£ + w — l)/(w£ ,/2 ) = 77 . Thus the non-zero 
eigenvalues of the matrix S a are the roots £ of 

(18a) £ + oi - 1 = a>V A r) 

where rj ranges over the roots of = 0 [i.e., the eigenvalues of / — A 
given in (5a)]. Since (18a) is quadratic in £ 1/2 , we find that all £ which satisfy 
this equation are given by 

(, 8 b, ?= { , = [(^) ± y(^)“ + (,-«)] a 

We may now determine co such that the iteration scheme (15) converges. 
First observe that since rj is real it follows from (18b) that |£_| > 1 for 
co < 0. Thus an eigenvalue of S w will have magnitude larger than unity 
and we conclude that the accelerated Gauss-Seidel method is not conver¬ 
gent for any non-positive co. For fixed w > Owe see that some eigenvalues 
may be complex (only if co > 1 ) but then their magnitude is 

(19a) |£| = w - 1 . 

For the real eigenvalues it follows from (18b) with co > 0 and rj > 0 that 
£ + is an increasing function of rj and that |£ + | > |£_|. Thus the largest 
real eigenvalue of is, since i? < r) l u 

(19b) { = ««.) = pp + + (1 - «) ]' 2 - 

From (19) we obtain for <*> > 0 

p(S<o) = max [co - 1 , £ 1 (w)]. 

As 0 < 7) ltl < 1 it follows that p(S w ) < 1 if 0 < co < 2 since in this 
interval, when £ t is real, 

< m ♦jfw - ■ 

On the other hand, if co > 2 then £i(w) is complex, and by (19a) some 
eigenvalue has modulus not less than unity. Thus we have 

theorem 2 . The accelerated Gauss-Seidel iterations converge iff the 
acceleration parameter co lies in the interval 0 < w < 2. ■ 
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The optimal value for the acceleration parameter is that value w = w*, 
in 0 < oi < 2, for which 

p(S at ) = min p(S a ) = min {max [to — 1, ^(to)]}. 

0 < co < 2 0 < w < 2 


[We know that the indicated minimum exists since p(S w ) is continuous 
in 0 < co < 2 and satisfies p(S 0 ) = p(S 2 ) = 1, p(5 w ) < 1 in 0 < co < 2.] 
It is clear that the expression in the radical of (19b) is a decreasing function 
of co for 0 < co < 2. Thus £i(co) becomes complex when this expression 
vanishes, i.e., for 


CO = OJ b = 


2 

1 + Vl - 7)\ tl 


For co > w b we have now p(S a ) = co — 1 and 


min p(SJ = co 6 - 1. 

Ob < co < 2 


For 0 < co < co 6 , since Ci(oj) < 0, £i(co) is a decreasing function of co. 
Hence p(S w *) occurs for co* = co b at which co b — 1 = ^(co^) = piS^). 
Thus, in summary, we have for the optimal application of the accelerated 
Gauss-Seidel method 


(20a) 


a,* = --L 

1 + Vl - 

~ Vi.i 


(20b) 

p(S a .) 

* , 1 - Vl 

= 1 = - 7= 

l + Vi 

“ Vi.i m 

- vi.i 

From (5), we have 





p(s a .) = 


+ u 

) + 0(S 2 ) 

and so the 

rate of convergence is now 



(21) 

^AGS 

= 28n J 2 (b + 

b) + 

0(8 2 ). 


By comparing (21) with (6) and (12), we see that the power of 5 in the rate 
of convergence for the optimal accelerated Gauss-Seidel method is lower 
than the power of 5 appearing in the ordinary Gauss-Seidel or Jacobi 
methods. The same result is obtained if the iterations were to proceed 
in one of the other orders indicated after (7b). This is suggested by the 
form of our results in which the coordinate directions and related dimen¬ 
sions enter symmetrically (see, however, the discussion at the end of the 
next subsection). 




LINE OR BLOCK ITERATIONS 471 


[Sec. 2.1] 

2.1. Line or Block Iterations 

Since the linear system (1.18), which we are solving iteratively, has the 
simple block structure indicated in (1.16a) it is rather natural to consider 
corresponding block iterations (i.e., Subsection 4.3 of Chapter 2). In the 
present application, these are more properly called “line” methods since 
the net function is altered by changing the data on a complete coordinate 
line of net points in D d simultaneously. A particularly simple line iteration 
for the system in (1.16) is 

[I } - e x (Lj + L/)]ur 11 - W> = 8 2 F 1} 

(22a) -Wx + [Ij - e x (Lj + L/)]UL V + 1) - W>, = S 2 F fe , 

2 < k < K - 1, 

-W-x + [/, - ULj + L /)]US + 1) = W K . 

The K systems for the U ( fc v + 1) can be solved in any order. At each of the 
K steps in one of these iterations, a linear system of order J must be solved 
with the coefficient matrix l 7 — OJJLj + L /). However, this matrix is 
tridiagonal and can easily be factored by the method of Subsection 3.2 
in Chapter 2. This is done only once and then each linear system in the 
succeeding iterations is solved by evaluating two simple recursions of the 
forms (3.12) and (3.13) of Chapter 2. The present scheme is frequently 
called a line Jacobi method. 

By using the matrices and vectors in (1.17) we can write the iterative 
scheme (22a) as [compare with (3a)] 

(22b) (/ - //)U (V + 1) = KU (V) + S 2 F. 

The convergence is thus determined by the matrix 
(23a) (/ - H)~ l V 

whose eigenvalues, p, are the roots of the characteristic polynomial 
(23b) P ( P ) = det | P I - P H - V\. 

It is not difficult to show that the matrices H and V have common eigen¬ 
vectors (since they are symmetric and commute). In fact, the eigenvalues 
and eigenvectors of these matrices are easily computed. Just as the eigen¬ 
value problems (1.21) and (1.22) are equivalent, it follows that the follow¬ 
ing pairs of eigenvalue problems are also equivalent 

(~w xx = on D d , 

(24a) (2 6 X I - H )W = £W, 

[ W= 0 on C 6 ; 

(-fV yS = (r)jh 2 )W on D t , 


(24b) (2 6 y I - K)W = t?W, 


W = 0 


on C„. 
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[In fact, if we set A = £ + 77 and add corresponding equations we obtain 
(1.21) and (1.22), by using 6 X + 8 y = ±.] The problems in (24) may be 
solved by separating variables and recalling (1.23)—(1.27). We find that 
these problems have common eigenvectors W p>9 with the components 


(25) fV P .„(xj, y k ) = sin (/ j~\) sin ( k JTTl)' 

\ < p < J, \ < q < K, 

and the eigenvalues 

(26a) £ = £* = 40* sin 2 ^ = 2 ^[l - cos (w 

1 < p < J; 

(26b) v = v, = 40„ sin 2 ^ = 20 „[l - cos 

1 < q < K. 


Each eigenvalue f p of the problem (24a) has multiplicity K and each eigen¬ 
value t) q of (24b) has multiplicity J. The eigenvalues of H and V are easily 
obtained from the above and are 


20 * cos (*JTt) 


and 


28 y cos 


( w 


-3—), 

K+ 1/ 


respectively. 

The vectors W p q are also eigenvectors of (/ — H)~ 1 V and multiplication 
by this matrix yields the eigenvalues, which are also the roots of P(p) = 0 


(27) 


Pv > Q 


"-“•rn 


1 — 2 8 X cos 


pTT 

TT1 


1 < p < J, 1 < q < K. 


The maximum magnitude of the eigenvalues is found by the usual expan¬ 
sions and some simplification to be 

(28a) max |p„.«l = p ui = 1 - — + p) + 0(S 4 ). 

Hence the rate of convergence for this line Jacobi scheme is 

(28b) R l , = + 53 ) + W. 

Note the similarity between this result and that in ( 6 ) for the (point) 
Jacofci iterations and that in (12) for the successive iterations. If Sx — Sy, 
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then 8y 2 = 4S 2 and the above rate is essentially that of the Gauss-Seidel 
method given in (12). 

Of course, an analog of the method of successive iterations is also possible 
for the line methods. We need only use the latest improved data as soon 
as it is obtained. Thus in (22) we replace U ( fc v i x by U! c v 2' 1 1) for k = 2, 3, . .., 
K , to obtain a line Gauss-Seidel scheme . In matrix form this successive-line 
method is written as 

(29) (/ - H - 0 y B) U (v + 1) = 0 y B T U (v) + S 2 F. 

However, an accelerated version of these iterations is of interest, and we 
directly consider this more general procedure. As before, an intermediate 
iterate V u + 1) is defined by 

(30a) (/ - //)V (V + 1) = O y BU iv+1) + 6 y B T U (v) + S 2 F. 

Then with an arbitrary parameter w we set 

(30b) U (v + 1) = o>V (v + 1> 4- (1 - w)U (v) . 

The calculations are performed a line at a time, as in the line Jacobi 
method, to determine the Vk v + 1) and then the UJ C V+1) before going on to 
k ■ 4-1. However, now they must be done in a fixed order (say increasing 
or decreasing k ). For w = 1 this scheme reduces to that in (29). 

The accelerated successive-line method becomes upon the elimination 

of ycv + i) in (30) 

(/ - H - u>6 y B) U (v + 1) 

= [(I - w)I - (1 - w)H + wd y B T ] U (v) + cu8 2 ¥. 

To examine the convergence of the scheme, we must determine the eigen¬ 
values of the matrix 

(31a) ss (/ - H - w^-'Kl - w)I - (1 - w)H + o>6 y B T l 

which is determined from the roots r of the characteristic polynomial 

(31b) QJj) s det |(r + a» - 1)(/ - H) - rwd y B - to6 y B T |. 

We see that the matrix in (31b) has zero elements wherever the matrix 
A has them and so just as in (17), we can apply the corollary to Theorem 1. 
With the scalars a = 1 and — t~ ] ' 2 we then get from (31b) and (23b) 

QJir) = det |(r + W - 1)(/ - H) - a>T*V\ 

= det Ur^/I -det - + w ~ ^ (/ - H) - V 

COT'* V 

= (o^yxp( T ± w - ' )■ 

\ OJT / 
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It follows that r and the roots p of P(p) = 0 are related by 

(32) T -b (x) — \ — 

by the reasoning that led to (18a). For cd = 1, the iterations reduce to the 
ordinary successive-line iterations and the non-zero roots r are given by 
t = p 2 . Thus this method converges twice as fast as the line Jacobi 
method. Finally, since the eigenvalues p lie in 0 < p < 1, the arguments 
used in (18)-(21) can be applied to the roots r(cu) and the acceleration 
parameter w, which satisfy (32). Now the optimal parameter value 
and minimum value p(T a,*) of p(T^) become, where p(T) denotes the 
spectral radius of T, 


(33a) 


2 

i + vi - p?.,’ 


(33b) P (T a ,) = - 1 - - - Vl P \ x 

1 + vi - pl tl 

By using (28a), we find 


P(TJ = 1 - 28 y*J(L + 1) + <m. 


and hence the rate of convergence of the optimum accelerated line Gauss- 
Seidel method is 


(34) R ALas = 23^7(5 + 1) + ®(S 2 ). 

To compare rates of convergence we note, using (21), that 


(35) 


^ALGS 

^AGS 


+ &( 8 2 ) 

V2 S 



+ ®( 8 2 ). 


Thus it follows that for any mesh ratio , Sy/Sx, the optimum accelerated 
successive-line method has a larger rate of convergence than the corre¬ 
sponding optimum accelerated successive (point) iterations. For equal 
net spacing in the x- and ^-directions the factor of improvement is, 
asymptotically, Vl. However, if Sy > Sx, even greater improvement 
results. We observe here that the net lines along which the new data are 
obtained at each step should be in the direction of the smallest mesh 
width; i.e., the “closest” neighbors are grouped together on a line and 
improved as a group. All of the above could be repeated with H and V 
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interchanged which corresponds to taking lines in the ^-direction. The only 
change in (35) that would result is the interchange of Sx and Sy. A decision 
as to whether the ALGS scheme is more efficient than the AGS scheme 
must depend upon the size of 

H °P S ALGS _ 

// — 

# ops AGS 

a measures the ratio of the amounts of work involved in one iteration 
step for each of the two methods. If 


then the ALGS scheme is more efficient; otherwise, the AGS scheme is 
more efficient. 

2.2. Alternating Direction Iterations 

One of the most effective iteration schemes for solving the system (1.16) 
or (1.18) employs a combination of horizontal and vertical line iterations. 
In terms of an acceleration parameter oj, and recalling that 20* 4 - 20 y = 1 , 
such a scheme due to Peaceman and Rachford can be defined as follows: 

(36a) [(a> + 20*)/ - H] U v + ,/2 = [(<*> - 2 0 y )I + K]U V + S 2 F, 

(36b) [(o> + 2 6 y )I - F]U V + 1 - [(<u - 20*)/ + H] U v + 1/z + S 2 F. 

The vector U v + ,/2 is an intermediate quantity used to define the scheme 
and of course it is actually computed in carrying out the procedure. The 
first step, (36a), is just a horizontal line scheme, similar to line-Jacobi. 
(In fact, with a) — 29 y in (36a), we obtain (22b) with U v + 1 replaced by 
U v + 1/z .) Clearly then (36b) is essentially a vertical line-Jacobi iteration. 
The vector to be found in each of the stages (36a and b) is easily evaluated 
by solving a tridiagonal system. 

To study the convergence of this scheme, we eliminate U v + 1/a in (36) 
and obtain, assuming for the moment that the required inverses exist, 

U v + 1 = Q„ U v + f,, 

where 

(37) 0 , = [(« + 20 y )/ - 20*)/ + H] 

x [(« + 20*)/ - Hy'Kw - 2Q y )I + VI 

and 

f„ = [(« + 20,)/ — V]~ l 

X {[(a> - 20*)/ + H][(<*> + 20*)/ - H}- 1 + 1}S 2 F. 
The eigenvalues of Q a are easily obtained since the matrices (20*/ — H) 
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and (2 B y I — V) have common eigenvectors given in (25). We obtain 
using (24) and the eigenvalues given in (26) 




P.Q 


(<*> - jpX^ - Vv) w 

(» + + v<) 


Thus the eigenvalues, say A(a>), of Q w are 


(38) 


\ ( \ = ~ €p)( w ~ Vq) fp = 1*2,. .,7, 

" (OI + fp)(« + ,,)’ u = 1, 2,..., K. 


Since > 0 and r) q > 0 for all p and q , it follows that the alternating 
direction scheme (36) converges for any choice of a> > 0. We also note 
that all relevant inverses exist for positive oj. 

The trick in the proper use of the alternating direction type schemes is 
not to use a single acceleration parameter w as above but rather to use a 
sequence of them, say u> l9 <d 29 .. co m applied periodically (or cyclically). 
That is, the calculations in (36) are to be carried out m times (using each 
for a complete double sweep of the net) in order to compute U v + 1 from 
U v . To actually write this scheme out we should introduce 2m — 1 inter¬ 
mediate quantities U v + 1/(2m) , U v + 2/(2m) ,.. ., U v + 1 ~ 1/(2m) and successively 
use (36a and b) for the pairs U v+(2i ~ 1)/<2m) , [/ v + 2 ^ 2 m) < As before, we find 
that the eigenvalues which determine convergence are now 


(39) 


Ap,q( W ^2, 


m 

■ ■ > w m) = n 


( = 1 


Q t - - Vv) 

(<*>i + + Vo) 


fp = 1,2,...,7, 

\q = 1 , 2 . 


If we take m = J and choose for j = I, 2,.. , 7, then it clearly 

follows from (39) that 

Ap. fl ( W l> cu 2 , . . CUj) = 0 

for all p and q . In this case the exact solution is obtained in a finite number 
of steps. Of course we could also employ cu 4 = 7? f with m = K to get 
similar results. However, both 7 and are extremely large in general 
and we desired to obtain an accurate approximation in only v iterations 
where mv « 7 and mv « K. Thus we consider the problem, with fixed 
small m, to find such that 

max u 2 , • • •, Wm)| 

P,Q 


is minimized with respect to all possible choices of the acceleration 
parameters 
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This problem is related to the subject of best approximations, Section 4 
of Chapter 5. Specifically let us define the function 


(40) 


m 


p{z) s n 

i = i 


k - f) 

(cOj + z) 


Then from (39) we have, recalling (26), 

max | A„ ,(«>!, w 2 ,.. <u m )| < max |F(X)F(»|. 

p.q 

Vi <v<vk 

Thus we seek such that the rational function F(x)F(y) is the best 
(uniform) approximation to zero on the rectangle < x < r) l < 
y < rj K . The optimization problem is further simplified by noting that for 
all x, y on this rectangle 


|*WOO| ^ max F\z) m \\F(z)\\l 

a<z<0 


where a = min (£ x , and {$ = max (fj, r} K ). Thus our problem is 
reduced to finding the best approximation to zero of the form (40) on an 
interval 0 < a < z < {$. The existence and uniqueness of such a best 
rational approximation can be proved in a manner analogous to the 
treatment in Section 4 of Chapter 5 of best polynomial approximations. 
We shall not present the analysis here of how to determine the opti¬ 
mum parameters w*. Rather, we show how to find a set of parameters 
ojj, for which we can estimate \\F\\<x> in order to compare the rate of con¬ 
vergence of the cyclic alternating direction method with the previously 
studied iterative methods. 

In Problem I, we verify that for m — 1 the choice <n l = VceyS minimizes 
Halloo, and 


\\F\\ 


1 — Va/jS 

1 + Vafp 


Hence we divide the interval [a, /?] by points 0 < a 0 — a < a x < ■ ■ * 
< a m = j8, such that 

2? = 5i = ... = °^-i 

a l «2 


The values a, which have this property are 



(41a) 

We now set 
(41b) 


CDj — a ; 
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and find that, since the magnitude of each factor of F(z) is bounded by 
unity, 


max \F(z)\ < max 


/ = 1 , 2 ,. 


K + z\ 

On the other hand, with the choice (41), the result in Problem 1 implies 


that 


max 

at - i <z<ai 


ajj — z 


(L>i + Z\ 


for all z. Hence 


IIFL < 


1 - 
I + 

1 - 
1 + 



1 - 


I + 



112m 


l/2m 


From the definitions of a and /?, and the results (1.25) and (1.26), we note 
that a = 0(S 2 ), Z 1. Therefore, 

||F|U < 1 - 0(S 1/m ). 


Hence we have shown that the rate of convergence of the cyclic alternating 
direction method is less than 0(S 1/m ), for a complete cycle. But the amount 
of work required to compute m sweeps as in (36a and b) is equivalent to 
the work required for about 2m applications of the line accelerated schemes. 
Now, the convergence rate of 2m applications of a line accelerated scheme 
is 0{2mh). This is much smaller than 0(S 1/m ), the convergence rate for 
one cycle of the alternating direction method for small m. We have thus 
shown that the alternating direction method is more efficient than any of 
the other iterative schemes, even when parameters that are not necessarily 
optimal are employed. For detailed comparisons we refer to the book 
of Varga. In practice it is wise to start each cycle with the largest parameter 
value, oj m , and then successively to use the smaller values. 


PROBLEM, SECTION 2 


1. Given 0 < a < j8, show that 



min < max 

Osct> 1 

CD — Z 

\ „ 1 “ ^ P 

CD + Z 

) 1 + V^/P 


and that the minimum value is attained for a) = cd* ~ v a p. 

[Hint: The function (cu — z)/(o> + z) is a monotonic function of z for any 
fixed cd. Hence it attains its extreme values at z = a and z = /?. Equal 
extreme values are attained for cd — cd*.] 
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3. WAVE EQUATION AND AN EQUIVALENT SYSTEM 

We consider the initial value or Cauchy problem for the wave 
equation : Find a function u(x , t) continuous in the half plane 


D = {x, 1 1 t > 0, —oo < x < oo} 
which satisfies, for t > 0 , 


(i) 


d 2 u , 8 2 u 
8t 2 C dx 2 " 0; 


and for t = 0 , 
( 2 a) 

( 2 b) 


u(x, 0 ) = f(x), 
du(x , 0 ) 


dt 


= £(*)• 


This problem may be solved explicitly in terms of quadratures. That is, 
by using the change of variables 

£ = x 4- ct y Tj — x ct, <f>(£, 77 ) = m(x, /), 

we find 

8 8 8 , 8 _ 18 8 \ 

dx ~ d£ + dr, and dt - C \d£ dr,)’ 


whence equation ( 1 ) reduces to 


4c 2 


8 2 <f> 

Iffy 


= 0 . 


The general solution of this equation is found, by two integrations, to be 
of the form 


<£(f, v) = + Qii)- 

Thus the general solution of (1) is 

(3) u(x, t) = P(x + ct) + Q(x — ct ), 


where P and Q are arbitrary (twice differentiable) functions. Since 
P{x + ct) is constant along lines + ct = constant, this part of the 
solution can be considered as a signal or wave which propagates to the 
left with speed c > 0 as time increases. Similarly, Q(x — ct) represents 
a wave moving to the right with speed c. The lines in the x, f-plane along 
which the signals travel, 

x ± ct = constant, 

are called the characteristics of equation ( 1 ). 
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The initial conditions (2), when applied to (3), yield 

P(x) + Q(x) = f(x), 

P'{x) - Q'(x) = x - g(x). 

Thus by integrating the second relation over [0, x] and using the first, 
we may solve the pair of equations for P(x) and Q{x). That is, set 
K = i[/>(0) - (2(0)] and find 

p(x) = if(x) + ~j* g a) dt + k, 

Q(x) = if(x)-±- c j*g(Odt-K. 

If we replace x in P(x) by x + ct and in Q(x) by x — ct, we get from (3) 
the solution of the initial value problem 

(4) u(x, t) = i[ f{x + ct) + fix - ct)] + 2 J ^ g(^) dl 

Clearly, the solution at any point (x*, r*) depends upon the initial data 
only in the interval [x* — ct *, x* + ct *] on the initial line, t = 0. This 
interval is cut out by the two characteristics passing through (x*, t*) 
shown in Figure 1. The shaded triangle in this figure is called the domain 
of dependence of the point (x*, t*) and its base is the interval of dependence . 

The Cauchy problem, (1) and (2), can also be formulated as an initial 
value problem for a first order system of partial differential equations. 
In particular, introduce the function t;(x, t) and consider 

du __ dv 
dt C dx 

(5) 

dv _ du 

dt C dx 
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subject to the initial values 

(6a) u(x, 0) = /(x), 

(6b) v(x, 0) = <7(x). 

Equation (1) results after the elimination of v(x, t ) from equations (5). 
With the same change of variables used before, we find that the general 
solution of (5) has the form 

u(x, t) — p(x 4 - ct) + q(x — ct), 

v(x, t) = p(x 4- ct) - q(x — ct), 

where p and q are again arbitrary functions but only required to possess 
one derivative. To satisfy the initial conditions (6), we must have 

p(x) = if(x) + \G(x), 

q(x) = if(x) - iG(x), 

and hence the solution of the initial value problem (5) and (6) is 

w(x, 0 = i[f(x + ct) + /(x - ct)] + i[G(x + ct) - G(x - ct)], 

(7) 

v(x, t) = ilflx + ct) - fix - ct)] 4- i[G(x + ct) + G(x - ct)]. 

A comparison of the solutions u given in (7) and (4) shows that the two 
Cauchy problems are equivalent if in (6b) we take 

G(x) = - |* g(£) di 4- constant. 
c Jo 

This relation could have been derived directly by satisfying the first 
equation of (5) at t = 0 and using (2b). 

For the system (5), the lines x ± ct = constant are again the character¬ 
istics, and the domain of dependence is still as in Figure 1. Of course, as 
is clear from (7), the solution at any point (x*, t*) is now determined by 
the values of the initial data (6) at the points where the characteristics 
through (x*, /*) intersect the initial line, t = 0. These properties of the 
system (5) become particularly transparent if we first add and then subtract 
the equations in this system to get 



This is called the characteristic form of the system (5) and the combinations 
u 4 v are the characteristic (dependent) variables . In the (x, t )-plane 
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^ ± represent differentiation in two specific direc¬ 

tions, the characteristic directions , and each characteristic variable is 
differentiated in an appropriate one of these directions. The notions of 
characteristic direction, characteristic variable and domain of dependence 
are useful in the treatment of more general equations of hyperbolic f type 
in two independent variables ( x , t). For example, consider the simplest 
case of a linear system of n first order differential equations for the n 
functions {wj that are the components of u, 

a A d 

-u + A Tx U = b. 

Suppose that the square matrix A = A(x , t) has n real eigenvalues («,) 
and a complete set of eigenvectors. Let P be the matrix whose columns 
are the eigenvectors of A. Then define v by u = P\ and insert in the above 
system. This yields 

| (ft) + A±(P,) = b, 

or, by differentiation 

P I (V) + AP Vx {y) = b - [l p ) y ~ A {k p ) y • 

If we multiply both sides on the left by P -1 , we find 

(4, + p - ,Ae iy- f "[' , -(li p y^ A (r x p y: 

This system is in the simple characteristic form. That is, differentiation 
in only a single ( characteristic ) direction , 


occurs in each equation, since P l AP is a diagonal matrix with the (oq) 
on the diagonal. The components of v — P ^u are the characteristic 
variables . 

We refrain from giving the definition of characteristic surface, which 
plays a vital role in the theory of partial differential equations in more 
dimensions. It is sufficient to say that the notion of domain of dependence 


f A system of partial differential equations is said to be of hyperbolic type if the 
Cauchy initial value problem is well posed for this system. For a linear system of 
equations, simple algebraic properties of the coefficients have been shown to imply the 
hyperbolicity of the system, e.g., the conditions on the matrix A above. 
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is important for hyperbolic equations in higher dimensions but the notion 
of characteristic form of the system does not generalize. 

The homogeneous characteristic equation has the form 


(8a) 


dw dw 

-f « *7T- = 0. 

dt dx 


Now on any curve jc = jc(r) in the (x, r)-plane, w is a function of t given 
by w(x(f), t) and has the total derivative 

dw _ dw dw dx 
dt dt dx dt 


Thus if the curve is chosen such that 


(8b) 


dx 

Tt 


= a, 


then any solution w of (8) satisfies dw/dt = 0 and hence, is constant on 
such a curve. The curves (8b) are the characteristics and if a is not a 
constant, they are not straight lines. 

The Cauchy problems previously formulated and solved could have been 
solved by the method of separation of variables. Instead, we shall now apply 
this method to a special mixed initial-boundary value problem for the wave 
equation. The problem of interest is to solve the wave equation 


d 2 u 
dt 2 



subject to the initial conditions 
(9a) u(x y 0) = f(x\ 

(9b) ~ (x, 0) = 0, 0 < x < L, 

and the boundary conditions 

(10a) «(0, 0 = 0, 

(10b) u(L f 0 = 0, t > 0. 

The solution is to be determined in the strip R = {x, t \ 0 < x < L, t > 0}. 
For convenience we have used a homogeneous initial condition in (9b). 
Let us seek solutions of the wave equation in the form 

u(x, t ) = <i>(x)4>(t ) 


which satisfy the boundary conditions (10). Then we must have 

nx) i m 
<Kx) c 2 m 


= -5 tt —\ = k 2 = constant. 


and 


m = m = o, 
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where two primes indicate d 2 /dx 2 and two dots indicate d 2 /dt 2 . The general 
solutions of the differential equations which result from this separation 
are 


<f>(x) = ae kx + pe~ kx , 
ip(t) — ae ckt + be~ ckt . 

From the boundary condition <j>(0) = 0, we get a = — while <f>(L) — 0 
implies that 

e kL = e~ kL or e 2kL — 1. 

Thus the boundary conditions can only be satisfied if k has pure imaginary 
values such that 2 kL = Imri, or 

, . nir 

k = / —» n = 1 , 2 ,.... 

We omit n = 0 since it leads to the trivial result <f>(x) = 0. With a = 1 
and any coefficients a and b , we have shown that 4>(x)ijj(t) is a solution of 
(1) satisfying (10), if 


<Kx) = <£»(*) = sin — 

#/ x . . rnrt . tlirt , _ 

0(0 = 9n(t) = a n sin c -j- + b n cos c —> « = 1 , 2,... . 

Thus, formally, a solution of the wave equation which satisfies (10) is 
given by 

oo 

«(*, 0=2 ^(x)<p n (t). 

n = 1 

To satisfy the initial conditions (9) with this solution, we require that 

00 00 

(11a) J 6 n sm^= f{x), 2 sin= 

Multiply each of these relations by sin (. mirx\L ) and integrate over [0, L], 
if the series converge uniformly, to find, since 

J sin nO sin md dd = - S mn , 

(lib) *„ = | JVW sin ^dx. a n = 0, n = 1,2,.... 

The coefficients are just the Fourier coefficients for the expansion of 
f(x) in a sine series. If f\x) is piecewise continuous, this series converges 
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uniformly to f(x). The solution of the mixed problem (1), (9), and (10) 
is given by 


(He) 


/ x V 1 niTX n7T t 
u(x , t) = 2, b n sin cos c — 


^ b n \ . nrr . . mr /] 

= 2 y [ sin x + + sin X x ~ ^ J ‘ 


If the function f(x ) has a piecewise continuous third derivative, the 
series in (11) defines a function u(x, t ) with continuous second derivatives 
which can be evaluated by differentiating (11c) termwise. Hence, u(x, t) 
defined by (11) is a solution of the mixed problem. Equation (11c) again 
shows that u(x, t) is the sum of functions of the two variables x ± ct. 
In fact, in this special case, 

u(x , t) = P(x + ct) + P(x — ct). 


where 


P(X) ee 


2 
n = 1 



"ZT 


/W 

2 


3.1. Difference Approximations and Domains of Dependence 

On the half space t > 0, \x\ < oo, we introduce the uniformly spaced net 
points 

X)=Ax, t n = n&t; \j\,n = 0, 1, 2. 

The set of net points Z) A is defined by 

= {xj, t n \j = 0, ±1, ±21,2,...}. 

A direct approximation of the wave equation (1) is obtained by using 
centered difference quotients, as in Section 1, to replace derivatives. Thus 
if U(x, t) is a net function, we consider the difference equations 

(12a) U t t(x, t) - c 2 U xx (x, t) = 0, (x, t) e D A . 

If we take the point (x, t) = (jc ; , t n ) and use the subscript notation 
U(xj , t n ) — U Un , then (12a) can be multiplied by A t 2 and the result re¬ 
written as 

(12b) £/,. n+1 = 2[l - (c^) 2 ]t/y. n 

+ ( C Ax) W + l.- + Uf- 1.") — Ui.n-U 

n > 1, |y| = 0, 1,.... 
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The star of net points entering into (12) is the same as that in Figure 1 
of Section 1 (with y replaced by t). To calculate U on anytime line t = t n + u 
say, the values of U must be known on the two preceding time lines. 
Thus in order to start the computations indicated in (12), we require 
data on the two initial time lines t = 0 and t = At. This is consistent with 
the form of initial data given for the wave equation in (2). A simple 
adaptation of these conditions is 

(13a) U(x 9 0) =/(*), 

(13b) U t (x, 0)ee UfaHt) = g(x). 


From (13b), we have 

(13c) U(x, At) = U(x 9 0) 4- At U t (x, 0) 

= /(*) + A tg(x). 


More accurate approximations to u(x. At) can be obtained if we assume 
that / and g are sufficiently differentiable and that the wave equation (1) 
is satisfied on the initial line, t = 0. That is, by Taylor’s theorem 


u(x, At) = u(x, 0) + At 0) 


At 2 8 2 u(x, 0) 

U df 2 


+ <9{At a ). 


But since u(x, t) satisfies (1) and (2), 

Q) = c2 shtjx, o) = c 2 r(x) 

dt 2 dx 2 J \ h 


hence 

u(x, At) = fix) + Atg(x) + C 2 f"{x) + <9{A ? 3 ). 

This suggests replacing (13c) by the formula 
(14a) U(x, At) = f(x) + Atg(x) + ^ c 2 JM, 

or equivalently the replacement of (13b) by 
(14b) U t (x, 0) = U { (x, At) = g(x) + ~ c 2 U x - x (x, 0). 


Even more accurate approximations than (14a) can be derived by continu¬ 
ing this procedure. For instance, the next term would involve 


d 3 u(x 9 0) 2 d 3 u(x , 0) 

8F “ c foFdt 


= c 2 g''(x). 
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The difference problem posed by (12b) and (13a and c) [or (12b), 
(13a), and (14a)] is called explicit since it is in a form in which the solution 
is obtained recursively by evaluating the given formulae. (This was not the 
case for the elliptic difference equations of Section 1 and 2, where a major 
part of the task was to solve the difference equations efficiently.) A glance 
at (12), (14), and the star in Figure 1 on p. 447 indicates that the solution 
at any fixed net point, (x*, t *), depends only on the values of U at the net 
points in the triangle formed by the initial line and the two lines with 
Ay 

slopes + A//Ax, say * — a7 ( ~ constant, which pass through (x*, t*). 

This region is shown in Figure 2 and it may be called the numerical 
domain of dependence for the difference equations (12). 

Clearly, the numerical domain of dependence will be greater than or 
equal to the domain of dependence of the wave equation, for the same 
point (x*, t*) 9 iff 

A£ < \ 

A x ~ c 

We refer to 1 jc as the characteristic slope and to Ar/Ax as the net slope . 
Therefore, if the characteristic slope is greater than or equal to the net slope, 
then the numerical domain of dependence includes the domain of depen¬ 
dence of the wave equation. We introduce the ratio of these slopes as 

A — net s ^°P e — cAr 

^ ' ~ characteristic slope — Ax 

and then the above condition becomes A < 1. Note that since c is the 
speed of propagation of a signal or wave for the wave equation, A is the 



Figure 2. Net points and numerical domain of dependence for difference 

scheme (12), 
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ratio of the distance such a signal travels in one time step to the length 
of a spacial step of the net. Thus if such signals cannot move more than 
the distance Ax in the time Ar, then the numerical domain contains the 
analytical domain of dependence. 

To understand the significance, for difference schemes, of these domains 
of dependence, we consider two Cauchy problems for the wave equation. 
The first is that posed by (1) and (2) with the solution «(x, t) given in (4). 
In the second problem we retain (2a) and replace g(x) in (2b) by 


fO x < a, 

g*(x) = g(x) + < 

{4cm(x — a) x > a; 

where x = a is an arbitrary fixed point. By using (4) with the new initial 
data, the solution, w*(x, r), of the altered problem is found to be 

fO x + ct < a 9 


«*(x, 0 = w(x, 0 + < 


1 1 rx + ct 

5 - 4 cm(£ - a) 

J m&x (a, x - ct) 


dl 


x + ct > a, 


(16a) 


= u(x, t) + 


0 x + ct < a, 

m(x + ct — a) 2 x + ct > a > x — ct , 

[4cmt(x — a) x — ct > a. 


Now for each of these problems let us consider the corresponding differ¬ 
ence problem (12) and (13) for a net, chosen such that x — a is a net point 
on the initial line. If the difference solutions are denoted by U and U* 
respectively, then, since they have identical initial data on x < a, it 
follows from a consideration of the numerical domains of dependence 
that 

Ax 

(16b) U*(x , t) - U(x , 0, if * + -*7 t < a. 


If the net spacing is such that A > 1, then there are net points (x, t) 
which satisfy 

Ax , 

x + t = a and x + ct > a > x — ct. 

Ar 


At such points we have from (16) 

w*(x, t) — w(x, t) — m(x + ct — a) 2 


U*(x , t) ~ U(x , r) = 0 


a ~ ct < x = a 


Ax 
A t 


t. 
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If we let Ax ->• 0 and At -> 0 while A = constant > 1 and x = a remains 
a net point on t = 0, then clearly U(x , t) and U*(x y t) cannot both con¬ 
verge to the corresponding solutions u(x , t) and w*(x, t ). Thus we deduce 

_ heorem 1 . In general , the difference solution of (12) and (13) cannot 
converge to the exact solution of ( 1) and (2) as Ax 0 and At^O for 
constant A = cAt/Ax >1. ■ 

The requirement A < 1, which by the above observations is seen to 
be a necessary condition for convergence in general (i.e., for “all” initial 
value problems) is called the Courant-Friedrichs-Lewy condition (or some¬ 
times just the Courant condition for brevity). In other words, the numerical 
domain of dependence of a difference scheme should include the domain 
of dependence of the differential equation or else convergence is not always 
possible . We also call this the domain of dependence condition. 

The relationship between the notion of a domain of dependence and the 
convergence of a difference method is easily studied for the initial value 
problem of the single characteristic equation (8a), in which a > 0 is a 
constant. Then the characteristic curves, determined by (8b), are the lines 

x — at + constant. 

The solution of (8a) which satisfies the initial condition 

Mx, 0) = f{x), 

is thus 

w(x, t) = f(x - at). 

The domain of dependence of the point (x*, t*) is the set of points (x, t) 
on the characteristic, x = at + x* —at*. 

With the uniform net of spacing Ax and At we consider the difference 
equations, for a net function IT(x, t)> 

(17a) W t (x, t) + aW x {x, t) = 0. 

In subscript notation, with (x, t) = (x ; , t n ) and A = aAt/Ax, this becomes 

W Un , 1 =(1 + A )W jt n- m + l.r. 

For this scheme, W{x, t -h At) is determined by data to the right of x 
or directly below it. However, since a > 0, the exact solution depends upon 
data to the left of x. Thus for any A > 0, the numerical domain of depen¬ 
dence cannot contain that of the differential equation. Clearly then, this 
procedure is not convergent, in general, as Ax, At 0. 
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Now, let us replace the forward ^-difference by a backward difference 
to get 

(17b) W t (x, t) 4- aWs&x, t ) = 0, 

or equivalently 

^.n + 1 = (1 - A)H/ ; , n + 

Clearly, the actual domain of dependence is contained within the numerical 
domain if A < 1. If the exact solution is sufficiently smooth (i.e., say 
f"(x) is continuous), then we find that the local truncation error , r, is 

t(x, t) = W t (x, t) + t) = (9(At 4- Ax). 

With the definition 


e(x, t) = W(x, t) — w (x, t ) 

we obtain 

^'.n + i = (1 - A )e Un 4- Ae y _ l n - A tr Un , 

Now let E n = l.u.b. \e jtn \ and take the absolute value of both sides to 

; 

get, since A < 1, 

K. + l| ^ (1 - A )ki.n| + A k--l.n| + A 'l T ;.nl 

< (1 - X)E n + \E n + AtO(At + Ax) 

< E„ + AtO(At + Ax). 

Thus 

E n + 1 < E n + At(9(At + Ax) 
and a simple recursion yields, 

E n + 1 < E 0 4- t n + l (9(At 4- Ax), 
or 

| W{x y t) - w(x, 01 ^ II W(x 9 0) -/(x)|| 00 4- t(9(At + Ax). 

Convergence now follows as At and Ax vanish while A < 1, provided the 
initial data W{x, 0) approaches /(x). 

The second scheme converges for special choices of the mesh ratio , 
Ar/Ax, and is therefore said to be conditionally convergent. Of course, the 
first scheme never satisfies the domain of dependence condition while 
the second converges when it does satisfy this condition. However , there 
are schemes which satisfy the domain of dependence condition , are reasonable 
approximations to the differential equation but still do not converge for any 
value of the mesh ratio . Consider, for example, the scheme 

(17c) Wfat) 4- <*Wz(x,t) = 0, 
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which uses the centered x-difference quotient. The truncation error is 
now r - 0(A/ + Ax 2 ), and is at least as good as in the previous case. 
If A < 1, the domain of dependence condition is satisfied but this scheme 
does not converge, in general, for any mesh ratio (see Problem 1). Thus to 
determine convergent schemes it is not sufficient to examine domains of 
dependence and truncation error alone. Now the difference scheme (17c) 
can be modified in a simple way to yield a convergent scheme. We use 

^ {^(x, t + A?) - \[W{X + Ax, t) + W(x - Ax, 0]} 

+ aWt(x, t) = 0, 

in which W(x, t) in the forward difference W t (x, t) has been replaced by 
an average of two adjacent values. In subscript notation this becomes 

H0.. + 1 * ±0 - Wi + i.n + i(l + A 

and the truncation error is again 0(A t + Ax 2 ). If A < 1, the numerical 
domain of dependence includes that of the differential equation (8a); 
the coefficients I ± A are non-negative with sum unity; and convergence 
can be proved as above [see also the convergence proof in (31)—(35)]. 
It should be noted that this difference scheme can be written as 

W t (x, t) + aW,(x y t) - Ax J W x &x, t) = 0. 


3.2. Convergence of Difference Solutions 

The difference solution determined by (12) and (13) converges to the 
solution of the initial value problem (1) and (2), provided Ax ^ 0 and 
Ar-^0 while A < 1. The proof of this fact is somewhat complicated for 
A < 1 but is much simpler if the special mesh ratio condition A = 1 
holds. Hence we first consider this case in which the characteristic slope 
and net slope are equal. It follows that, with the definition 

A. 7 = Ui % j — 

the difference equations (12) can be written, when A = 1, as 

Dj, n + i ” 77 ; + liri . 

Note that the value U L n does not enter into the above difference equation. 
In fact, the net points may be divided into two groups, corresponding to 
the red and black squares on a checkerboard, and the difference equations 
do not couple net points of different groups. Thus we need consider only 
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one such group of net points. Application of the above form of the differ¬ 
ence equation recursively yields 

, n. + 1 + 1 , n 

= Dj + 2 , n — 1 

= ^j + n, l 

These relations are equivalent to the fact that the D t m are constant on the 
diagonal x + ct = constant through (x ; , f n + 1 ). We also observe by sum¬ 
ming the D i m along the diagonal x — ct = constant through (x ; , / n + 1 ) 
that 

n 

Uj.n+ 1 — f^y-n-1.0 ” 2 ~ v.n-v + 1* 

v = 0 

By combining the last two results and recalling the initial conditions 
(13a and c), we get 

n 

(18) + 1 = f/y-n- 1.0 T 2 + n - 2v,1 

v = 0 

n n 

= fi-n-\ + 2 ^Sj + n- 2v + 2 + ~~fi + n- 2v-l)* 

v = 0 v = 0 

This is an explicit representation of the solution of the difference 
problem (12)-(13). To examine convergence we shall let Ax = cA/—>0. 
Since t n = «A/ and x n = «Ax = ct n , it follows that F j + n = F(x j+n ) = 
F(xj 4- cf n ) for any function F(x ). Then if /(x) has a continuous first 
derivative 

ff + n- 2v y} + n-2v-l 2v 4" C^n) 2v 4“ Ax) 

= Ax/' / (x ; _ 2 v 4- 4- 0 v Ax), 0 > 0 V > -1. 

Now take the limit as Ax->0 and oo in (18), while t n + 1 = t and 

Xj = x, for any fixed (x, f), to get 

£/(*, 0 

t n 

= /(x — ct) 4- lim ~~ V S( x 4 - ct — [2v + l]Ax)2Ax 
ax-o 2c v ^ 0 

1 n 

4- lim ^ V f\x + ct + 0 V Ax — [2r+ l]Ax)2Ax 

Ax-»0 2 

= /(x - c 0 4" ^ J £(* 4- ct - 0 dfj 4- ^ /'(* 4- ct - 0 

= i [fix + Cf) + /(X - Ct)] + 2 JJ + * g (7? ) dr, 

= u(x, t). 
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Thus the proof of convergence, when A — 1 , has been completed by using 
the representation of w(x, t ) given in (4). 

We study the case of A < 1, first for the mixed initial-boundary value 
problem formulated in (1), (9), and (10) whose solution is given in (11), 
The difference problem can be formulated as the difference equations 
(12) where now D A are the net points in 0 < a < L, t > 0, and the mesh 
is such that, say, 

(J + l)Ax = L. 

The initial conditions (13) or (14) are to be satisfied for 0 < x < L, 
where g(x) = 0, and the boundary conditions for the difference problem 
are 

(19) U(0 9 t n ) = £/o.n = 0, U(L, t n ) = Uj + ltn = 0, n = 1,2,.... 

We now seek solutions of the difference problem by the method of 
separation of variables. In fact, from the experience gained in Subsection 
1.1, we try the forms 

U(x, t) = 'F <p) (0 sin [/» p= 1,2,.... 

From (12) 

*2' sin (/, S) = c*¥*Kt) sin**- (p 


where — 2 sin 



identities 


2 

C Uf(p) 

Ajc 2 


sin 



and we have employed the trigonometric 


. r 7 t(x + Aa)1 . [ 7r(x - A A')] 0 / ttAx\ . / 7TA\ 

sin [ p — l —J +sin [ p — lT~ J = 2 cos ^Tr) sin rT)’ 
1 - cos (^ ! r f ) = 2sin2 ( pt r)‘ 

It follows that l F <p, (t) must satisfy 

AtWtfXt) = - A 2 ^ 2< F (p) (0. 

and we try the quantities 

'F <p) (0 = sin ix p t. 

By using the same trigonometric identities, we get 

4 sin 2 sin p. P t = A 2 £ p 2 sin p p t. 
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A similar result is obtained if we use *F <P) (0 = cos p v t. Thus the terms 

(20) (A p sin /x p r + B p cos p p t) sin p = 1,2,..., 

will satisfy the difference equations (12) if fx p is such that 

4 sin 2 = AV> 
or 

(21) sin/* p y = ± A sin p=\,2,.... 

These transcendental equations have real roots, p. vy for all /?, iff 
A = cAr/Ax < 1. 

A linear combination of the solutions in (20) yields 

CO / v 

(22) U(x, 0=2 sin /V + cos /V) sin X/’ 

If this series converges, it is a solution of (12) and satisfies the boundary 
conditions (19). To satisfy the initial condition (13a) we must have 

OO j \ 

(23a) 2/” sin \Pt) =/W; 

while condition (14a) requires 

(23b) f (A ” sin fjipAt -f B p cos /x p A t) sin 

= /(*) + C 2 f xx (x) 

= (1 - \ 2 )f(x) + A+ M 

From (11 a) we see that (23a) is satisfied if i? p — 6 P , p = 1,2,..., where the 
6 P are defined in (lib). From the identity 1 — 2 sin 2 (0/2 ) — cos 8 and 
(21), we have 

cos (upkt) sin = (1 — A 2 ) sin 

+ f {sin + sin 



hence (23b) and (23a) yield 
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Clearly, this is satisfied by the choice A p — 0. Thus (22) is the solution 
of the difference problem (12), (13a), (14), and (19), if A p = 0 and 
B p = b p . The series (22) converges if (11a) converges absolutely since the 
(jl p are real (e.g., if f\x) is continuous). 

With the exact solution given by (1 lc), we obtain 

(24) | U(x, t) - u{x, 01 < 2 *p[ cos /V - cos si" {p 

I oo , -fix 

2 b p cos n v t sin I p 

p = N + 1 \ L 

+ I (?')“" ('ll 

By taking N sufficiently large, the last two sums can be made arbitrarily 
small for all Ax, A t [since the corresponding series converge absolutely 
if f\x) is continuous]. If Ax^O and Af—^0 while A — cAr/Ax < 1 
and {J -f l)Ax — L , we have from (21) for 1 < p < N, 



Thus \U(x, t) — m(x, r)| can be made arbitrarily small. This proves 
convergence of the difference scheme, if A < 1, for the mixed initial-boun¬ 
dary value problem, when /(x) has two continuous derivatives and 
g(x) = 0. Convergence for the case g(x) # 0 can be shown in a similar 
way. 

Now if A < 1, convergence can be proved for the pure initial value 
problem, by making use of the notion of domain of dependence. That is, 
given an interval [a, b] and time T y convergence in S : {(x, t) \ a < x < b, 
0 < t < T} y can be shown by modifying the initial data only for x < a — 
(Ax/Ar)T, and x > b + (A^/A t)T. For this modified problem, the solutions 
of the differential equation and of the difference equations are unchanged 
in S . In fact, if the initial data have been modified so as to be periodic 
and odd about [a — (Ax/Ar)T — S, b + (Ax/A t)T -F S], for some 8 > 0, 
then the above proof establishes convergence. 

The proof of convergence does not generalize to equations with variable 
coefficients; furthermore, it does not provide an estimate of the error in 
terms of the interval size Ax or A/; neither does it provide a treatment of 
the effect of rounding errors. These defects are avoided in the analysis 
of the next subsection for the case of a first order system of equations. 

3.3. Difference Methods for a First Order Hyperbolic System 

We have seen in equations (5) through (7) that the initial value problem 
for the wave equation can be replaced by an equivalent first order system. 
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In fact, such systems arise more naturally in physical theories and applied 
mathematics than does the second order wave equation. We shall treat 
the simple system (5) which can be written in vector form as 


(25a) 

where 

(25b) 


du 

dt 





1 } 


Most of our methods and results are also applicable to more general 
hyperbolic systems of first order partial differential equations in two 
independent variables. [For example, such systems may be formulated as 
in (25a) with the square matrix A of order n having real eigenvalues and 
simple elementary divisors, i.e., A is diagonalizable.} The initial conditions 
to be imposed can be written as 


(26a) u(x, 0) = u 0 (x), 

where for the problem posed in (5) and (6) we take 


(26b) 



The solution of (25) subject to (26) is given in (7). 

On the uniform net with spacing Ax and A t we introduce the net func¬ 
tions /7(x, t) and F(x, t ), or in vector form the vector net function 


U(x, o 



Then as an approximation to the system (25) with A = cAf/Ax, we consider 
(27a) U(x, t + AO - i[U(x + Ax, t) + U(x - Ax, t)] 


+ 2 >4[U(x + Ax, t) — U(x — Ax, t)]. 


[It would be tempting to replace the first bracketed term on the right-hand 
side above by U(x, t ), but as shown in Problem 1 that scheme is divergent.] 
If we subtract U(x, t) from each side and divide by A t, we can write (27) 
in the difference quotient notation 


(27b) U t (x, t) = cAVdx, t ) + Y x A*U**-(*. 0- 


Thus our difference equations are obtained by adding a term of order Ax 
to the divergent approximation of (25). The discussion following equation 
(17c) may be considered a motivation for (27a). 
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An immediate advantage of the scheme in (27) can be seen by considering 
the case A — 1 in which the net diagonals and characteristics have the 
same slopes and so the numerical and analytical domains of dependence 
coincide. By writing the system in component form and using Ax = cAt, 
we get from (27a) 

U(x, t + At) = \[U(x + cAt, t) + U(x - cAt 9 1)] 

+ i[V(x + cAt, t) - V(x - cAt, 0] 

(28) 

V(x, t + At) = i[U(x + cAt, 0 - U(x ~ cAt y t)] 

+ \[V(x + cAt, t) + V(x - cAt, 01 

For the initial conditions 


(29) U(x,0)=f(x)y V(x y 0) = G(x); 

a comparison of (28) when t = 0 with (7) when t = At shows that the 
numerical solution and the exact solution are identical on net points for 
which t = At. By considering the exact solution at this first time step as 
initial data, we find that the solution is also exact for net points with 
t = 2At. By induction, we can show that the difference scheme (27) with 
A = 1 subject to the initial data (29) has a solution which is equal to the 
exact solution (7) of (25) and (26) at the points of the net. (However, for 
higher order systems and variable coefficients, we do not get the exact 
solution for any fixed choice of A. These results suggest the use of the 
largest value of A for which the domain of dependence condition is 
satisfied.) 

Let us consider the scheme (27) with A arbitrary. From the considera¬ 
tions of domains of dependence we know that for A > 1, the difference 
solution cannot generally converge to the exact solution. Therefore, we 
restrict the mesh ratio by 0 < A < 1 and proceed to show that the 
approximate solution then converges to the exact solution (which is 
assumed sufficiently smooth). By using the exact solution u(x, t) of (25) 
at the points of the net, we define the local truncation error t(x, /), 


(30) t(x, t ) 


u t (x, t ) - cAui(x, t) - jx Axu ^ x ’ 0 


-(*-!)- 4.-© 


-a 4 * 


/ _ 

8x 2 ) 


c A d 2 u 
2\ Ax d^ 


= 0(A t + Ax). 
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e(x, t) = U(x, t) - u(x, t), 
then from (27b) and (30) it follows that 

e t (x, t) = cAe-(x, t ) + ~Axe x ^(x 9 t) - t(a, /), 

or as in (27a) this is equivalent to 

(31) e(x, t 4- At) = 4- A A)e(x 4- Aa, 0 

4- i(f — A4)e(x — Aa, t) — A/t(a, t). 

Here we have introduced the identity matrix /. 

Since A is symmetric, it can be diagonalized by an orthogonal matrix. 
We have, in fact, 

(32) ™»-(J _")• L(| J)- 

Now let us introduce the vector net function 

(33) 0 = (**; ;j) - AC*. 0 

and multiply (31) by P on the left to get 
€(a, t 4- A?) = 4 \PAP*)z(x 4- Aa, t) 

4- W - \PAP*)*(x - Aa, 0 - ArPx(A, t). 
By taking absolute values and using (32), we find in component form 
|f(x, t + AO | < ill + A|-|f(x + Ax, 01 + ill - A|-|£(x - Ax, r)| 

+ ki(x, 0 + t 2 (x, 01 

h <x,t + A0| < i|l - A|-|,(x + Ax, 0| + i|l + A| • \ v (x - Ax, r)| 

+ ki(*. 0 - t 2 ( x , 01- 

Since 0 < A < 1, the absolute value signs can be removed from the 
factors |1 ± A|. Then with the definitions 

(34) E(t) s sup ||e(x, Oil, "(0 = sup ||t(x, Oil, 

X X 

where the norm of a vector is the maximum absolute component, we 
deduce 


E(t + A?) < E(t) + V2 Ata(t). 
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A recursive application of this inequality yields 

_ um 

(35) E(t) < E(0) + V2Ar 2 - vAt ) 

V = 1 

< E(0) + V2/K0II 

where 

MOll = sup a{t r ) = sup ||x(x, Oil- 

t'<t X 

t’<t 

Now we recall that, from (32) and (33), e(x, t) = Pe(x, t) and so 

IIOil =£ Pl-H*. Oil 

< V2 ||£(x, 01 

< V2 E(t). 

Thus it follows from (35) and the definitions of e and E that 

(36) \\V(x, t) - u(x. Oil < V2 sup ||U(x, 0) - u(.v, 0)|| + 2 1 ||<r(f)||. 

Note that the suprema on the right side need be taken only over points 
in the domain of dependence of the point (x, t). By using the initial 
data (29) and the estimate (30) of the local truncation error, the above 
implies 

(37) ||U(x, t) - u(x, Oil < t&(&t + Ax). 

Thus, as was to be shown, the difference solution of (27) and (29) converges 
to the exact solution of (25) and (26) as Ar —0 and Ax —> 0 for A = cAr/Ax 
< 1. The convergence here is at least first order in A t or Ax. In Problem 2, 
the numerical scheme (27c) is shown to be convergent if the rounding 
error is of the same order as the truncation error. 

We remark that the scheme (27) is convergent for hyperbolic systems 
of order n in the form (25a), where A is diagonalizable (see Problems 
3 and 4). 


PROBLEMS, SECTION 3 

1. Show that the difference scheme W t — W^ with constant A = At /Ax, 
is divergent as an approximation to dw/dt = dw/dx. 

[Hint: Show that e Bt e iax is a solution of the difference equation, if i 2 — — 1 
and 
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Consider now the initial data 

W(x,0) = | 2' 2r cos ^ 2 r x. 

r = 0 

Show that 1^(0, /) —>oo if Ax = 2~ n and n —* 00 . That is, set a r — 7r2 r_1 and 
W(x, t) = Re f 2 _ 2r e e ^e ia r x . 

r = 0 

Show that the term r — h dominates the sum of all of the other terms as 

n —> co, in 

00 n / — \ d/Ad 

H^(0, /)>- 2 2 -2r + Re 2 2~ 2r (l + /Asin-2 r n ] .] 

r=n+l r=0 \ 2 / 

2. Show that the difference scheme 

(27c) U« = c/lU* + ^ + P(*> 0. 

V(x, 0) = u(x, 0) + p(x) 
converges with error 

||U(x, /) - u(x, OH < t(9(At + Ax), 

if the rounding errors p(x, /) and p(x) are at most of magnitude 0(At + Ax). 

3. Carry out the proof of convergence of scheme (27) for the case of a 
system of n equations (25a), where A is a constant matrix having a complete 
set of eigenvectors. 

4. If A = A(x , /) has a uniformly bounded matrix of real eigenvectors 
P(x , /), with uniformly bounded inverse i > “ 1 (x, /), then show that (27) is a 
convergent scheme for (25a). 

5. Given the difference scheme W t = W* (shown in Problem 1 to be diver¬ 
gent if A = AtjAx is constant), prove convergence if A = /xAx with some 
constant /x, for the periodic initial value problems 

w(x, 0) = /(*) = 2 a n e ,nx 

n = — co 
oo 

such that 2 n \ a n\ < °°- 

n = - oo 

[Hint: Verify that the function 

oo / A/ \ tt&t 

W(x,t) = 2 + /^sin/iAxj e inx 

is defined by the series, satisfies the difference equation, and converges to 
/(x + r) as Ax —> 0, if At /Ax = fiAx. That is, show by using Lemma 0.1' of 
Chapter 8, 

|1 + i sin nAx\ tl&t < (1 + /x 2 Ax 2 sin 2 nAx) tn2At) 

< gin tl2) sin 2 nAx 

< e utl2 .] 

Such a difference scheme is rather inefficient, since too many time steps are 
required. 
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The initial value problem for the heat equation is: Find a continuous 
function w(x, /) that satisfies 


(la) 


du d 2 u _ 

3t " fr? ~ ° 


t > 0 ; 


(lb) u(x , 0) = /(x) “OO < x < co. 

The solution of this problem is found to be 

( 2 ) u(x, t) = m dt 

J - oo V 4rr t 

Here we assume /( x) to be bounded and continuous, and then direct 
differentiation under the integral sign shows that (la) is satisfied. Since 



e y2 dy — V7T 


we may write (2) as 

r oo «-«-X)2/4£ 

u(x, t ) = m + [m - /w] 

J - co V 4rrt 

and now let t —>• 0 from above. For all ^ ^ jc we have 

^-(?-x)2/4t 

lim-—=— — 0 

0 V477? 


and for £ = x, the remaining factor in the integrand vanishes. Thus it is 
plausible that we could prove that the function given by (2) is continuous 
and satisfies the initial condition (lb). 

Now from (2), we see that if f(x) > 0 in an open interval (a, b) and 
f(x) = 0 outside (a , b), then w(x, t) > 0 for all x when t > 0. Thus we 
may say that signals propagate with infinite speed for the heat equation. 
Clearly, the form of the solution in (2) shows that the domain of dependence 
of a point (x, t) with t > 0 is the entire x-axis (or initial line). 

With the uniform net spacings Ax, A t, and net points D A in the half¬ 
space t > 0 (see Subsection 3.1), we consider the difference equations 

(3a) U t (x, t) - U xx (x, t) = 0, (x, t) e D a . 

In subscript notation with (x, t) — (x j9 ? n ), this can be written in the form 
(3b) U u n + i=(l - 2A)£/ />n + KU i + 1 . n + Ui- l.n), « = 0, 1,.... 

Here we have introduced the mesh ratio 


A 


A t 


Ax 2 


(3c) 
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The net function t/(x, t) is also subject to initial conditions which, from 
(lb), we take as 

(4) U(x, 0) = f(x). 


The net points used in the difference equation (3) have the star or 
stencil of Figure 1. The solution is easily evaluated by means of (3b) and 
we see that the numerical interval of dependence of a point (x, /) in the net 
is the initial line segment [x — (Ax/A t)t, x + (Ax/Af)f], Thus in order to 
satisfy the domain of dependence condition of Subsection 3.1, which is 
again valid, we must have that Ar/Ax —0 as At 0 and Ax 0. Other¬ 
wise, the numerical interval of dependence of the difference equation (3) 
would not become arbitrarily large, and hence convergence could not 
occur in all cases. 

If, as the net spacing goes to zero, the mesh ratio A defined in (3c) is 
constant, then Ar/Ax = A Ax ->• 0 and the domain of dependence con¬ 
dition is satisfied. We shall show, in fact, that if 0 < A < then the 
difference scheme (3) and (4) is convergent; but if A > ^ the difference 
solution does not generally converge to the exact solution. As usual, the 
truncation error r(x, t) on D A is defined by writing for the exact solution 
«(x, 0 of (1) 


(5a) u t (x, t) - u xx (x 9 t) = t(x, 0, (*, 0 e Av 


By Taylor’s theorem the truncation error can be expressed, assuming u 
to be sufficiently smooth, as 


(5b) 


A/ d 2 u Ax 2 d*u 

t(x ’ 0 = T d? ~ 12 dP' 


With the definition 


e(x, /) — U{x , /) — w(x, t ) 

we get from (5) and (3) 

e Jt n + i = (1 “ 2A )e Un + A(e > + 1>n + e^ Un ) - A tr Un . 



A 

A 


A 

is K 

A ^ 

A_ 

- Q 

9 3 

9 K 

9 


x j - i x j x j + i 


Figure 1. Net points of star for the explicit difference scheme (3). 
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If 0 < A < then (1 — 2A) > 0 and with the definitions 
E n = sup |e,-. n |, r = sup |ry. B |, 

j ;. n 

the above yields upon taking absolute values 

h.„ + i| ^ (1 - 2A)|e,. n | + A(|c i + liB | + |ey_ 1>n |) + A/|r y>n | 

< E n + Air. 

Or since the right-hand side is now independent of j\ 

E n + 1 < E n + A tr. 

Hence, by a recursive application 

E n < E 0 4- nAtr = E 0 4- t n r. 

Thus we have deduced that 

( 6 ) |w(x, t) — U(x y t) | < sup |w(x, 0 ) — t/(x, 0 )| + t(9(At + Aa' 2 ). 

X 

Therefore, by recalling (4), \u(x, t) — U(x, f)| 0 as A/ 0 and Ax 0, 

if A = At /Aa 2 < The convergence demonstrated here is of order 
0{Ax 2 ) since At = AAx 2 . 

To demonstrate the divergence of the difference scheme (3) when 
A > we first construct explicit solutions of the difference equations. We 
try net functions of the exponential form 

F (a) (x, t) = Re (e 1 **-* 1 ). 

Then 



Now V {a) is a solution of the difference equations provided that oj and 
a satisfy 

e -(oAt = i — 4 a sin 2 

The initial conditions satisfied by V (a) are 


(7a) V ia) (x , 0) = Re e iax — cos ax 

and the solution can be written, since e~^ 1 = (e _&>At ) i/At , 

/ aAxV /A * 

(7b) K (a) (*, 0 = cos axil — 4A sin 2 * 
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Clearly, for all Ax and A/ such that A < \ and real a, it follows that the 
solution (7) satisfies \ V (a) (x, 0| < 1. However, if A > then for some a 
and Ax we have |1 — 4A sin 2 (aAx/2)| > 1 and so | K (a) (0, t )| becomes 
arbitrarily large for sufficiently large t/At . We will capitalize on this 
instability (see next section) of the difference scheme (3), if A > to con¬ 
struct a smooth initial condition for which divergence is easily demon¬ 
strated. Since the difference equations are linear and homogeneous, we 
may superpose solutions of the form (7) to get other solutions. With 
a — a v ~ 2 v 7t and coefficients ft > 0, we form 

(8a) V(x, 0-2 0 

V = 0 

= 2 ft cos (2 v 7tx)^ 1 — 4A sin 2 —^—j * 

The corresponding initial function 

00 

(8b) V(x, 0) = f(x) = 2 Pv cos (Virx), 

V “ 0 

has as many derivatives as we wish provided that y3 v —0 sufficiently fast. 
Now let Ax = 2 _m and A t = A4" m so that (8a) yields 

® r / \ *1 tIAt 

V(0, 0= 2 ^[1 - 4Asin 2 

= 54-4Asin 2 (2 —|)r+ 2 A- 

v - 0 L \ J v = m + 1 

But 

sin 2 (^2 v_m \ for v = 0, 1 ,..m — 1, 

and so the above yields, for i < A < 1, with ft > 0, 

|K(0,0l >- 2 ft + ft(4A - 1)^ 

v = 0 

= -/(0) + ft(4A - i) t4m '\ 

Now if the ft are chosen as ft = e" 2V , then the initial function /(x) is a 
smooth (analytic) function and the estimate yields, for \ < A < 1, 

(8c) | K(0, 01 > - V(0, 0) + e 2mMM2m 

Thus, as aw —> oo, it follows that |K(0, 01 becomes unbounded, for any 
finite t > 0, since 4A — 1 > 1. Hence this difference solution cannot 
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converge to the solution of the corresponding smooth problem with 
initial data given by (8b). Thus, as was to be shown, the scheme (3) and (4) 
does not generally converge when A > We say that the difference 
scheme (3) is conditionally convergent which means that the scheme is 
convergent only if A satisfies some condition, i.e., A < In the next sub¬ 
section, we will see that it is possible to construct unconditionally convergent 
schemes for the mixed initial-boundary value problem. 

4.1. Implicit Methods 

To demonstrate implicit difference schemes, we consider mixed initial¬ 
boundary value problems for the inhomogeneous heat equation. That is, 


(9a) 

du 8 2 u , . 

Tt~W = s{x ' 0 

0 < x < L, t > 0 

(9b) 

u(x, 0) = fix) 

0 < x < L; 

(9c) 

u( 0, /) = g(t), 

w(L, t) = h(t), t > 0 


The net spacing is now chosen such that 


and the net points in the interior of the half strip 

D = {x, t \ 0 < x < L y t > 0} 
we denote by Z) A ; i.e., 

£> a = {x, t\ x — /Ax, 1 < j < J; t = Mr, n = 1,2,...}. 

For a net function U{x , r), we define the implicit difference equations 

(10a) Ut(x , t) - U x ;(x, t) = s(x , r), (x, t) e D A . 

In subscript notation, again with (x, t) = (x y , r n ), these equations can be 
written as 

(10b) (1 + 2A )U Un = + KU i + 1 , n + £/;_!,n) + A ts Un \ 

n = 1,2,..., 1 < j < J. 

The only difference between (3a) and (10a) is the time difference quotient, 
which is forward in (3) and backward in (10). The star associated with 
(10) is shown in Figure 2. The initial and boundary data are specified in 
the obvious way 

(lla) U(x h 0) = C /,.0 = f(xj), 0 < Xj < L; 

(llb) U(0, t n ) = U 0 , n = g(a U(L, /„) ^ U J + Un = h(t n ), t n > 0. 
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A 


ft ft 

ft 



ft _ 

V 



V 



Xj _ 1 Xj Xj + 1 

Figure 2. Net points of star for implicit difference scheme (10). 


For each t = the equations in (10) and (1 lb) form a system of J + 2 
linear equations in the unknowns £/ ;<n , 0 < j < J + 1. However, since 
U 0t1l and U J + a . n are specified in (lib), it can be reduced to a system of 
order J . In fact, with the coefficient matrix A of order J defined by 


(12) A = I + \B B = 


and the 7-dimensional vectors U n , b n , s n , and f defined by 




the systems (10) and (11) can be written as 

(14) A\J n = U n _i + Ab n + A/s n , n = 1, 2,...; U 0 = f. 
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For each n , a system with the same tridiagonal coefficient matrix A must 
be solved. Since A > 0, it follows that the lemma of Subsection 3.2 in 
Chapter 2 applies. Thus, not only is A non-singular, but the solution of each 
system is easily obtained by evaluating two simple two term recursions. 
Of course, the factorization A — LU need only be done initially. [See 
equations (3.10) through (3.13) of Chapter 2.] 

It is clear from (14) that the difference solution U n at any time i n 
depends upon all components of the initial data, U 0 , for any value of A. 
This is also clear from the form of the star corresponding to the difference 
equations (10). Thus for any value of the mesh slope , At/Ax, the numerical 
domain of dependence is the entire initial line segment and hence the 
domain of dependence condition is automatically satisfied by the implicit 
difference scheme. We shall show that, in addition, the implicit difference 
solution converges for all values of A to the exact solution. In other words, 
the scheme is unconditionally convergent . 

The truncation error r(x, t) of the solution u(x , t) of (9) is defined for 
the difference scheme (10) as 

(15a) r(x, 1) = ui(x 9 1) - u x ;(x, t) - s(x , t) 9 (x, t) e Z) A . 

Since w(x, t) satisfies (9a), we obtain by the usual Taylor’s series expan¬ 
sions, assuming sufficient differentiability of the solution, 

(15b) r(x, t) = 0{At + Ax 2 ). 

Now from (10), (11), and (15) we get for the error, 

e(x 9 1) = £/(x, t) - w(x, t\ 


the difference problem 

(16a) e*(x, 0 - e X x(x 9 t) = -r(x, 0, (*, t)e D A ; 

(16b) e(x, 0) = 0; 

(16c) 6>(0, 0 = 0, e(L, t) = 0. 

In subscript notation (16a) yields 

(1 + 2X)e Un = e Un . 1 -b A (e j+1 , n + e f - ltU ) - Atr Un , 

n = 1,2,..., 1 < j < J, 

By taking absolute values and using E n = max \e Un \ 9 r = sup |r yin ,|, 

n'<N 

and A > 0, we get 

(1 + 2X)\e itn \ < £ n _ x + 2A E n + A tr, 1 < j < J f n < N. 
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Since the right-hand side is independent of j and e 0 .» = tj + i.n = 0, we 
may replace \e jin \ by E n to get 

E n < + Atr. 

Thus by the usual recursion technique 

E n < E 0 + t n T, 
or 

(17) | u(x, t) - U(x, 0| < max | u(x, 0) - U(x, 0)| + tr 

X 

< t(9(&t 4- Aa 2 ), t < NAt = T . 

From this result we deduce unconditional convergence as and 

Aa —>• 0, i.e., A is arbitrary. 

There are other implicit schemes which converge for arbitrary A and 
one of them in particular has a local truncation error which is 0(A t 2 + Aa 2 ). 
We examine the family of schemes defined by 

(18a) U- t (x, t) - [8U xi ( a, t) + (1 - 8)U xS (x> t - A/)] 

= Os(x 9 t) 4* (1 — 0)s(a, t - A t) f (a, t) 6 D a . 

Here 8 is a real parameter such that 0 < 0 < 1. For 0 — 1, (18a) reduces 
to (10a); while for 6 = 0, (18a) is equivalent to (3). For any 6 ^ 0, the 
difference equations (18) are implicit. The boundary and initial data are 
as specified in (11). In subscript notation (18a) takes the form 

(18b) (1 + 28X)U j%n - 8X(U j+Un + £/,- ltn ) 

= [I - 2(1 - mVj.n-l + (1 “ WU i + i.n-l + ^-l.n-l) 

+ Ar[^y,n + (1 — ^)^,n-l]j 

n - 1, 2.1 < y < 7. 

By using the matrices and vectors in (12) and (13), the system (18) and (11) 
can be written as 

U 0 = f, 

(19) (/ 4- 0A*)U n =[/—(! — 0)A£]U n _ 1 4- A[0b n + (1 - 0)^.*] 

+ Af[0s n 4- (1 - 8) s n _J, n = 1,2,.... 

These systems can be solved by factoring the tridiagonal matrix / 4- 8XB. 
Clearly, for 0^0, the domain of dependence condition is satisfied for 
arbitrary Aj/Aa. 

The truncation error t(a, t) now depends upon the parameter 0. The 
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usual Taylor’s series expansion about (x, t) yields, if the solution u(x , t) 
of (9) has enough derivatives and we make use of (9a) to simplify, 

(20) t(x, 0 = tr t (x, t) - [0u xi (x, t) + (1 - 0)u x ~(x, t - A/)] 

- [&y(x, t) + (1 - 8)s(x, t - A?)] 


du 

dt 


d 2 u , , 

a? ~ s{x > r) 


A t f d 2 u , m / d 3 u &y\l 

pp. - 2 (! ” d )[ 8t dx 2 + q- ( )\ 


(1 


+ 0[(A/) 2 + (Ax) 2 ] 

26)At d 2 u /f1/A 9 A 
---2 + + A * )♦ 


dt'< 


Thus for the special case 0 = the truncation error is 0(Af 2 4- Ax 2 ). 
[In this case all the difference quotients in (18) are centered about 
(x, t + Aj/ 2) and the difference method is called the Crank-Nicolson 
scheme.] For arbitrary 6 , the truncation error is 0(A t + Ax 2 ), as in the 
explicit and purely implicit cases. With the notation e = JJ — u we obtain 
from (20), (18), (11), and (9b and c) 

(21a) ej(x, t) - [0e A j(x, /) + (1 - 6)e x - x (x, t - A?)] 

= -r(x,t), (x,t)eD & ; 

(21b) e(x, 0) = 0; 

(21c) e(0, 0 = 0, e(L, t ) = 0. 

Let us write (21) in vector form by using the matrix B of (12) and the 
vectors 



to get e 0 = 0 and 

(22) (/ + 0\B)e n — [/ — (1 — e)XB]e n . x - A n = 1,2,.... 


Since A > 0 and I + 9XB is non-singular, we may multiply by the inverse 
of this matrix to get 

(23a) e„ = Ce n _! + Am„, n = 1, 2,.. 

where we have introduced 
(23b) C = I - {I + 


o n = -(/ + dXB)~ 'x n . 
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(24) e n = C"e 0 + Ar 2 C- 1 ®,. 

V = 1 

Upon taking norms of this representation of the error, we get 

(25) ||e n || < ||C|| n -|[e 0 || + At J IlCT' 1 -1®.II 

V = 1 

s ||C|| n -||e 0 || + A? j^y - max ||o v ||. 

[This could have been deduced directly from (23a) by first taking the norm 
and then applying the recursion.] 

Let us use a special norm in (25) for which we are able to compute 
IICH, i.e., ||x|| = (J |* t |») Then, since B is symmetric, it follows that 
C in (23b) is symmetric and by (1.11) of Chapter 1, 
l|C|| = p(C) m max |y/C)|, 

i 

where y ; (C) is an eigenvalue of C. That is, the spectral radius of C is the 
corresponding natural norm. The eigenvalues of B are easily obtained. 
We note that the matrix B is related to the matrix H in (1.17b). Using Lj 
of (1.14b) we have B — 21 — ( L } + L 3 T ) and the calculations in (1.23)- 
(1.25) are applicable. Specifically the eigenvalues /? ; of B are found to be 

(26a) p, - 4 sin 2 ^ 7=1,2,...,/; 


corresponding to the eigenvectors 

^sin [l>/(7 + 1)]) 

(26b) - ■ S ' n l2JnKJ + 01 


j= 1 , 2 ,...,/. 


sin [/>/(/ + 1)V 
Thus the eigenvalues, y y , of C defined in (23b) are 

Aft 


(27) 

In order that 


Yi 




i + exp, 


P(C) = max |y ; | < 1 
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we must have -1 < y, < 1. Since j8 ; - > 0, it follows that y j < 1 for all 
A > 0 and 6 > 0. Now y j > — 1 is equivalent to 

(1 - 20)Aft < 2, 

and this is satisfied for all A > 0 if 0 > Thus we have shown that 

(28a) He'll <1 for all A > 0 if 0 > 

On the other hand, for 0 < 6 < i, we must have A < 2/[(l — 20)/S ; ]. 
Or since 0 < ft < 4 for j = 1,2, . .J, this implies 

(28b) ||C || <1 for A < 2(1 l _ 26) if 0 < 6 < j. 

Under either of the conditions (28), we obtain from (25) and (23b), since 

Kll * ||(/+ OAio-l-Ki < j + \ ¥i K|| * IWI, 

that 

(29) || e„ || < ||e 0 || + | max ||t v || 

= De 0 || + j _ A j c| | 0[(6 - i)A/ + Ar 2 + Ax 2 ]. 

To examine the convergence properties as Ax 0 and A t —>• 0, we note 
that for small Ax 

A = (jAx) 2 + 0(Ax 4 ), p, « 4 - (jAx) 2 + d'(Ax 4 ). 

It is easily established that I — [x/(l + 0x)] is a decreasing function of x 
and an increasing function of 6 for x > 0 and 6 > 0. Thus 

(30) IICD =max(n, |y,|), 
and for small Ax 

(31) Y i = 1 - A (j Ax ) 2 + ^( A ^ 4 ) 

= 1 - A t£ + 0(AAx 4 ). 


In case (28b), for any 6 in 0 < 6 < % and A < l/[2( 1 — 26)], we get an 
upper bound for \y } \ by picking the largest value of A, 

& l 


(32) 


\Yj\ * 


1 - 


2(1 - 26) + epA 
< 1 - (i - + ®(Ax 4 ). 
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1 - l|C|| > 


Hence, in case (28b) we find from (30), (31), and (32) that 

ftf(At), or 
[(P(Ax 2 ). 

Therefore (29) yields for 0 < 0 < 0 < A < 1/2(1 — 20), 

(33) ||<? n || < ||e 0 1| + max (1, A )<9[{6 - i)Ar + Ar 2 + Ax 2 ]. 

Finally, in case (28a), 0 > \ and A arbitrary, we get an upper bound for 
|yz| by picking the smallest value of 0, 


(34) 


\vA i 


1 - 


A h 


< 1 


1 + -JA /J, 
4A 


(^Axj + ®(AAa 4 ) 


1 + 2A (1 + 2A) 

The last inequality is most useful in the case of very large A, i.e., 
4A , A 


Wj\ ^ 


1 + 2A 


1 


(1 + 2A) ; 


^Axj 2 + C(AAx 4 ) 


“ 1 _ A 




A » 1. 


Therefore, 

(O(At), 
i - \\C\\ > ] 

ki/A), e > i 

Hence (29) becomes for 0 > -£, A arbitrary, 

(35) ||e n || < ||e 0 1| + max (1, AAO^[(0 - i)A/ + Af 2 + Ax 2 ]. 


Inequality (35) indicates that with the choice 0 = AA t — constant, 
the error is bounded by 

(36) ||e.|| < ||e 0 || + <P(,Ax 2 ). 

Of course, this error bound is of the same magnitude as the error estimate 
for the explicit scheme, but for a much larger time step. That is, even 
though the number of operations required to solve the implicit equations 
for one time step is of the order of twice the number of operations required 
for one time step of the explicit scheme, there is a tremendous saving in 
labor when we choose AA t — constant for the Crank-Nicolson scheme. 
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1. Given u{x, t) continuous in 

D s D u C x u C 2 , 

where 

Z) = {x, t | 0 < x < L, 0 < r < T}, 

Ci = {x, t \ 0 < x < L, t = T} y 

C 2 = {x f t \ 0 < x < L y t = 0; 

0 = jc, 0 < f < T; 

x = L, 0 < t < T). 

If, in D u Ci, w(x, /) has a continuous second derivative with respect to x 
and a continuous first derivative with respect to t, such that 

du _ d 2 u 
dt dx 2 ’ 

then 

max u(x , /) 

JteD 

is attained at some point of C 2 . (This is a weak form of the maximum principle 
satisfied by the solutions of the heat equation.) 

[Hint: For any e > 0, set 

i?(x, t) = u(x , /) — €?. 

Clearly, v{x y t) cannot attain its maximum at any point P of D U Ci, since 
otherwise 


hence 

d 2 v _ dv 
dx* = ~8t + 

would be positive at P. Therefore, z;(jc, t) attains its maximum only on C 2 . 
But since e > 0 could be arbitrarily small, this implies that u(x y /) must attain 
its maximum on C 2 .] 

(Note: By considering 


W(x, t) = - u(x 9 t), 
we can establish the minimum principle , 

min u{x 9 t) 

x<=D 

is attained at some point in C 2 .) 

2. Verify that the solution U jt „ of (10a and b) and (11a and b) satisfies 
the maximum principle 


V n < max { V 0 + nAtS n , G ny H n } 
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where 

V n = max \U J , n \ > n — 0,1,2,..., 

j 

S n = max |5(x, t )|, 

0<.t£nAt 
0 £ JC £ L 

G n = max |^(0|, 

0 £ t <. nAt 

H n = max \h(t)\. 

0 & t £ n At 

[Hint: Modify the argument that estimates E ny after equations (16a, b, 
and c).] 

3. Formulate and prove a maximum principle for the explicit scheme 


(10a') 


Ut(x , /) - U xx (x y t) = S(X y 0, 
(x, t) e Z) A , with A < i. 


Use the auxiliary conditions (11a and b). 

4. Prove convergence of the finite difference solutions U(x , /) to the solution 
of the mixed initial-boundary value problem in (9a, b, and c) with s(x, t) = 0, 
under the weak compatibility condition 

/(0) = *(0); f(L) = h( 0), 

and with f(x ), g(t) and h(t) that are continuous. 

[Hint: Uniformly approximate /(x), g(t), and h(t) by polynomials 
g m (t) y and h m {t) satisfying the strong compatibility condition , 

/«'(0) = *i(0), / m (IV) ( 0) = g' m { 0); 

/ m '(L) - h' m ( 0), f m ^\L) = «(0). 

That is, assume there exist corresponding smooth solutions w m (x, /) and, for 
any given (Ax, A/), difference solutions U m (x, t) as well as the continuous 
solution u{x y t) and the difference solution U(x y t). Then estimate 

u(x, t) - U(x y t) = [u{x y t) - u m {x y 0] 

+ [u m (x y t ) - U m (x y t )] 

+ W m (x y t) - U(x y /)], 


by using the maximum principles for the outermost bracketed terms to fix 
an m for which they contribute at most e for all (Ax, A/). Next, pick (Ax, A/) 
sufficiently small so that the middle bracketed term is at most e \ 


5. GENERAL THEORY: CONSISTENCY, CONVERGENCE, AND 
STABILITY 

The apparently scattered results of the preceding sections can be related 
by a simple general theory. A more complete and rigorous development 
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could be given with the aid of the simplest notions of functional analysis, 
which we forego. 

A partial differential equation can be represented symbolically as 

(la) L(u) — f(P)y PeD; 

with the convention that only terms involving the dependent variable u 
are included on the left-hand side of (I) and that all inhomogeneous terms 
are included in / [i.e., f is a function only of the independent variables, 
P = (f, x, y ,.. .)]. The domain in which (la) is to be satisfied is denoted by 
D. The set of points on which boundary and/or initial data are prescribed 
is denoted by C. The conditions to be satisfied by u on C can be represented 
as 

(lb) B(u) — g(P\ PeC. 

Here B may not be a differential operator, but (lb) merely represents the 
conditions imposed on various parts of C. For example, conditions (4.9b) 
and (4.9c) would both be incorporated in (lb) for the mixed initial¬ 
boundary value problem of (4.9). We shall only consider problems (1) 
for which a unique and smooth solution u exists for any data in some class 
of smooth functions {/, g} (smooth means “sufficiently” differentiable). 

Let us consider a net for the independent variables of the problem (1) 
with spacing: A t. Ax, A_y, .... Certain of these net points, say those interior 
to D will be denoted by the set D A . Similarly, boundary net points, C A , 
will also be defined. There are various ways in which this can be done, 
depending upon the difference method employed. Obviously net points 
lying on C may be included in C A , but frequently we may also wish to 
include the points of intersection of C with the net lines. (In fact, for some 
problems, net points outside of C are included in C A and points outside 
D are included in Z) A , but we shall not dwell on these possibilities in the 
present discussion.) 

At the points of Z) A + C A a difference approximation U is defined as 
the solution of some set of difference equations. These may be indicated 
symbolically as 

(2a) L a (U) — /(P), PeZ) A , 

which is to approximate (la); and the boundary difference approximations 
are indicated by 

(2b) B A (U)=g(P) y P e C A . 

Again the notation may imply different relations over different parts of 
C A , say as in (4.11). Of course, it is desired that the difference solution U 
of (2) should be a close approximation to the solution u of (1) at corre¬ 
sponding points of D A -F C a for all data that are sufficiently smooth. 
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Furthermore, the difference solution should be uniquely defined by (2) 
and its numerical evaluation should be possible without significant loss 
of accuracy due, say, to roundoff errors. To study these questions the 
three notions of consistency , convergence , and stability of the difference 
schemes are introduced. It is then easily shown that for consistent schemes , 
stability implies convergence. We begin with the definition of 

consistency. Let <f>(t , x, y ,...) be any function with “sufficiently many ** 
continuous partial derivatives in D -f C. For each such function and every 
point P e D a , let 

(3a) r{<f>(P)} = L(4>(P)) - LM(P))\ 


and for each point P e C A let 

(3b) mp)) s mp)) - bmp))- 

Then the difference problem (2) is consistent with problem (\) if 
(3c) Hfll-0, IIW} 1^0, 

when Af ^ 0, Ax 0, Ay 0, ..., in some manner , and || || represents 
norms in the appropriate sets D A and C A . We call r{cf>} and the local 
truncation errors. 

If (3c) is satisfied only when some particular relationship between the 
Ar, Ax, A y y ... is maintained (i.e., say provided that Ar/Ax ^ 0 as Ar ^ 0 
and Ax —> 0), then we say that the difference formulation is conditionally 
consistent. For example, with the heat equation operator: 


L(u) = 


du 

dt 


d 2 u 

dx 2 


the ordinary explicit scheme of (4.3a) can be written in terms of the 
difference operator 

L a (U) » U t - U x - x 

which is “unconditionally” consistent with L(u). However, the Dufort- 
Frankel explicit scheme employs the difference operator 

L A ’(U) = m + Ui) - (u x - x - ^ u u y 

which is consistent with L(u) only if At/Ax —^ 0 with the net spacing. 
In fact, if At/Ax = c = a fixed constant, then the difference operator 
L a \U) is consistent with the hyperbolic operator 
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These latter results follow from the simple calculation 
t{<I>} ss L(4>) - l a '(« 

= [i _ W i + H ~{U~ ***) 

A/ 2 / d 2 <t> \ A/ 2 a 2 <ft 

Ax 2 “ Ax 2 a/ 2 

/A/ 2 a 3 <n /Ax 2 a 4 f'\ / A/ 4 a 4 f\ A/ 2 a 2 <£ 

\ 6 a/ 3 j + V 12 ax 4 j “ \i2Ax 2 a / 4 / " Ax 2 a? 2 ' 

In the above consideration, we have neglected to mention that initial 
data must also be prescribed at t = Ar for L A ' to become an explicit 
scheme. In many cases, say the mixed problem (4.9) where (lb) represents 
(4.9b and c) and (2b) represents (4.11a and b), we have fl{<j>} = 0 and so 
only the difference approximation to the differential equation determines 
consistency. On the other hand, if (lb) represents initial conditions like 
(3.2) for the wave equation, then (2b) represents some approximation 
like (3.13) or alternatively like (3.13a) and (3.14b). In the first case, we 
obtain = (9(At) and in the second case 

m = £>[a/ 2 + A/Ax 2 + At(jf 2 - c 2 g)]- 

Here we have an example in which the order of the local truncation error 
is increased for special functions (i.e., solutions of the wave equation). 

In practice, we will not work with the exact solution of (2a and b), 
because of rounding operations. Hence we will consider the solutions W 
defined on D A + C A which satisfy the modified equations 

(2c) UW) = f(P) + P (P) P e D A , 

(2d) B A (W) = g{P) + o{P) P e C A . 

The functions p(P) and o(P) represent the error introduced in solving 
(2a and b) approximately. We will refer to p(P) and o(P) as rounding 
errors. 

We turn now to the definition of 


convergence. Let u be the solution of problem (1), and let U be the 
difference solution of problem (2). The difference solution is convergent to 
the exact solution iff 

HP) - u(P) || 

for all Pe D A + C A when At —> 0, Ax 0, Ay -> 0, ..., in some manner , 
and || • | represents a norm in D A 4- C A . 
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If the difference solution is convergent for all data in some wide class 
of smooth functions {/, g}, we call the corresponding difference scheme 
convergent. Notice, however, that this notion is quite distinct from that of 
consistency and. in fact, a scheme may readily be consistent but not 
convergent. The schemes (3.17a and c) which are consistent approximations 
to (3.8a) furnish two such examples. 

Care must be taken in observing that a difference scheme may be 
convergent for a class, J 5 ', of smooth functions {/, g}, but not convergent 
for a larger class D 3F. For example, consider the Cauchy problem 
given by 


du du 

L(m) ITt ~ Zhc = 0, 


t > 0, 


(4a) 


B(u) = w(x, 0) = e iax ; 


and the corresponding difference problem 


(4b) 


L a (U) a U t - t/; = 0 
B & (U) = U(x, 0) = e lax . 


We easily verify that for any real a, the solutions u and U of (4a) and (4b), 
respectively, are 


(5a) 

and 

(5b) 


u{x, t) = e ,a(t + x \ 


U(x, t) = ( 


A t \ ttAt t 

1 + / ^ sin a&xJ e lax . 


But if |oc| < A/, we have, for 0 < t < T and all a, the uniform convergence 
(6) lim U(X, t) — e ^at + ax) _ 

Ax.Ai-0 

In other words, the scheme L A (U) — 0 which we have shown to be diver¬ 
gent, in general, in Problem 3.1, is convergent when the initial data are 
chosen from the class of finite trigonometric sums, i.e., for any initial data 
of the form 

U(x, 0 )= 2 P,e ia ’ x - 

J= 1 

The reader should observe that even if the domain of dependence con¬ 
dition is violated by the difference scheme (4b) (i.e., A = Ar/Ax > 1), 
the solution (5b) will still converge to that in (5a). Thus for this special 
class of trigonometric data the difference scheme (4b) is unconditionally 
convergent. 
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We recall that the actual evaluation of a numerical scheme produces 
approximate solutions, say W, which satisfy slight modifications of 
(2a and b), say (2c and d). Hence, in practice, we are interested primarily 
in schemes for which || W — m||—> 0, when ||p||->0 and ||<r|| —^ 0, as 
hu —> 0, Ax —>■ 0, . . .. (The norms || || are defined respectively for the sets 
D A + D a and and the data g} are to belong to some wide 
class of functions.) We say that such schemes have convergent approximate 
solutions. 

When a convergent scheme is known, we are assured that difference 
solutions of arbitrarily good accuracy exist. When a scheme with con¬ 
vergent approximate solutions is known we are assured that difference 
solutions of good accuracy can be computedl But in either case, it is im¬ 
portant that an a priori estimate of the error can be evaluated, preferably 
in terms of the data and the mesh spacing. For this and other purposes we 
introduce the concept of 

stability. A difference scheme determined by linear difference operators 
L A ( •) and B A ( •) is stable if there exists a finite positive quantity K , independent 
of the net spacing , such that 

(7) it/n < *(iiz*(t/)ii + \mu)\\) 

for all net functions U defined on D A 4- C A . (The norms || • || are , as usual , 
defined for net functions on D A + C A , D A and C A respectively.) If (1) is 
valid for all net spacings, then the linear difference scheme {L A , B A } is 
unconditionally stable; if (7) holds for some restricted family of net spacings 
in which A/, Ax, A y, . . ., may all be made arbitrarily small , then (L A , B A } 
is conditionally stable. 

Clearly by this definition, stability of a difference scheme is a property 
independent of any differential equation problem. We have restricted this 
definition to linear difference schemes as they are the only ones treated in 
this chapter. However, a more general definition can be given which 
reduces to the above for linear problems. (This is an obvious restatement 
of the definition given in Section 5 of Chapter 8 for ordinary differential 
equations.) Briefly, if L A and B A are the difference operators in question, 
they are stable iffor every pair of net functions U and V defined on D A + C A , 
there is a K > 0 independent of the net spacing such that 

II v- v\\ < K(\\L a (U) - UV) 1 + II B a (U) - B a (V) II). 

If L a and B a are linear, then this reduces to the previous definition applied 
to (U - V). 

The factor K in the definitions of stability may depend upon the dimen¬ 
sions of the domain D containing D A . We have already proved the stability 
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of various difference schemes in the previous sections. For example, with 
the Laplace difference approximation (1.4), consider the difference 
equations (1.5). The corresponding difference operators are L A = — A d 
and B A U = U. By applying Theorem 1.2 we deduce that, for arbitrary 
Ax and Ay, 

||£/|| = max \U\ < AT(max [L/[ -t- max [A^L/1) 

Da + Ca C<) D<) 

= K(\\B a U\\ + \\L a U\\), 

where K = max(l,a 2 /2). Thus the difference scheme used in (1.5) is 
unconditionally stable. 

Next, consider the (hyperbolic) system of difference equations defined 
in (3.27). We define L A by 

L a (U(x, 0) s U,(.x, t) - cAU x (x, t) - ^ AjdJ^Gv, t\ 

and the initial data are to be given by specifying 
*a(U(x, 0)) = U(x, 0). 

(The generalization to vectors U, f, and g is taken for granted.) By using 
these definitions in (2) we have the difference problem 

L a (U(x, 0) = f (*, t), (x , t) in D a ; U(x, 0) = g(x). 

However, this is just the problem posed in (3.31) for e(x, t) where t 
replaces f and e(x, 0) replaces g(x). Thus, as in the derivation of (3.35) 
and (3.36), we deduce for the above difference problem that if A = 
cAt/Ax < 1, then with the maximum norms over the appropriate sets 

I|U(X, on ^ tf(llg|| + Ilf 1) = *(||*aU| + |L a U||), 

where K = max (Vl, It). Hence, conditional stability is established 
(i.e., for cAt < Ax) and we note that the constant K grows with the time 
interval included in D. 

Finally, consider the explicit difference equations for the heat equation, 
which we write as, 

L a U(x, t) = U t (x, t) - U xx (x, t) = /(x, t). 

If, initially, we take 

B a U (x, 0) = U(x, 0) = g(x), 

then exactly as in the derivation of (4.6) from (4.3) and (4.4) we get, 
provided A = At/Ax 2 < 

II U(x, 0|| s max |t/(.v, 01 S K(ll/(*)ll + max ||s(.v, 0)1), 

v t'<t 
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where K = max(l,r). Again we have conditional stability, for A t < 
Ax 2 /2, and the constant K grows with the time. The special choice 
s(x, t) = 0 and /(x) = V(x , 0) given by (4.8b), for the above difference 
problem shows that this explicit difference scheme is unstable for fixed 
A > The purely implicit difference equations given in (4.10) with 
initial and boundary data specified as in (4.11) are easily shown, by the 
methods used in (4.15) through (4.17), to form an unconditionally stable 
difference scheme. 

The basic result connecting the three concepts which we have intro¬ 
duced in this section may be stated as 

theorem 1. Let L A and B A be linear difference operators which are stable 
and consistent with L and B on some family of nets in which A t, Ax, Aj>,.. 
may be made arbitrarily small . Then the difference solution U of (2) is 
convergent to the solution u of { 1). 

Proof For each point P e D A and for any of the above family of nets, 
we obtain by subtracting (la) from (2a) 

o = l a (U(P)) - IMP)) 

= [L a (U(P)) - L a (u(P)) 1 4- [L a (u(P)) - IMP))]. 

From the assumed linearity of L A and the definition of the local truncation 
error, r{^}, we then have 

(8a) L A (U - u) = t{u(P)}, P g D a . 

In an analogous manner (lb) and (2b) imply 

(8b) B a (U- u) = p{u(P)}, ^C A . 

However, the difference operations in (8) have been assumed stable on the 
family of nets employed here. Thus it follows that for the net function 
(U - «), 

(9) lit/-Mil < *(K«}|| + I««>D- 

Now by the assumed consistency we may let A? -> 0, Ax 0, tsy 0, . . ., 
in such a manner that ||r||->0 and ||]8||->0. Then, obviously, 
|| U — w|| —> 0 and convergence is demonstrated. ■ 

It should be recalled, in the above proof, that the solution u of (1) is 
to have as many continuous derivatives as are required for the 
derivation of consistency. We then see from (9) that the error in the differ¬ 
ence solution is estimated in terms of the local truncation errors. With 
little change in the proof. Theorem 1 is applicable if L A and B A are non¬ 
linear stable difference operators. (It should also be observed that the 
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linearity of L and B have not been assumed in the above proof. Their 
linearity follows from the required consistency with linear difference 
operators.) 

5.1. Further Consequences of Stability 

The stability of a linear difference scheme comes very close to insuring 
that computations with the proposed scheme are practical. More precisely, 
what we are assured is that stable linear difference equations have a 
unique solution and that, at least in principle, the growth of roundoff errors 
is bounded. In this case, the scheme has convergent approximate solutions. 

Let the difference problem (2a and b) be linear. Then the difference 
equations form a linear system whose order is equal to the number of net 
points in D A 4- C A . Now we assume that the number of unknowns and 
equations are equal. Having made this important assumption (which in 
particular cases is easily verified) we may, in order to show that (2a and b) 
has a solution, either show that some coefficient matrix is non-singular 
or, equivalently, show that the corresponding homogeneous problem 
has only the trivial solution. However, from the assumed stability of L A 
and B A we get from (2a and b) 

\\u\\< *(1/11 + 11*1). 

It follows that the system has only the trivial solution if f = g = 0. 
Thus the unique solvability of the linear difference problem is a simple 
consequence of stability. 

The consideration of the effect of roundoff errors is also quite simple. 
If by W(P) we represent the numbers actually obtained in numerically 
approximating the solution of (2a and b), then 

(10a) L A W = f(P) + P (P) PeD A ; 

(10b) B A W = g(P) + c(P) P e C A . 

Here p(P) and o(P) represent the effects of rounding, which cause W to 
be in error, and hence not quite satisfy the system (2a and b). As in the 
proof of Theorem 1, we now derive by means of the linearity of L A and B A 

L A (W-u) = t{u(P)} + P {P) P e D a \ 

B a (W - U) = ${u{P)} + a(P) P E C A . 

From the assumed stability of the difference problem it now follows that 

(ii) || w-u\ < *(H| + ||0|| + ||p|| + H|). 

Thus we have shown that a stable and consistent linear scheme has con¬ 
vergent approximate solutions. 
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To maintain accuracy consistent with the local truncation errors, the 
local roundoffs p and o should be of the same order in the net spacing. 
The quantities p and a introduced in (10) are usually the actual rounding 
errors committed in computing, divided by some multiple of the net 
spacing. This is because one does not compute with the equations in the 
form (2), but rather with multiples of these equations in which the co¬ 
efficients are bounded as the net spacing vanishes. So the actual rounding 
errors should be reduced like the truncation error times a power of the 
net spacing. 

The definition of stability that we have given is considerably more 
restrictive than is required to prove convergence in many cases. In fact, 
the “stability constant” K may be allowed to depend upon the net spacing 
and be unbounded as, say, A? 0. But if in this case 

(12) lim (ktyK = 0 

At-* 0 

for some p > 0, then convergence still follows if ||r|| + ||/?|| = t v ). 
Convergent approximate solutions are also obtained if the norms of the 
rounding errors, \\p\\ and Hall, are required to be at least 0(A t p ). [When 
a condition of the form (12) holds for all p > p Q > 0 but not for p < p 0y 
the scheme is frequently said to be weakly stable .] In addition, as we have 
seen in the examples, many proofs of stability yield inequalities of the 
form 

(13) \\U\\ 

which then yield stability with the constant K = max (A^, K x ). However, 
if for example K 1 -+0 as the net spacing vanishes, then (13) would imply 
convergence if only L A were consistent with L but not necessarily B A with B. 
Thus it is possible to have convergence without consistency provided a 
stronger form of stability holds. In the case of such stronger stability 
it is clear that the error bound (11) can be replaced by 

(14) I W-u\\< ATodrll + ||p||) + *,(11011 + || or I). 

Thus a poorer approximation of the boundary conditions need not 
affect the overall accuracy if, as the net spacing vanishes, 

*i(Ifl| + Ml) ~ *o(IM| + ||p||). 

5.2. The von Neumann Stability Test 

There is an important special class of difference schemes for which a 
simple algebraic criterion always furnishes a necessary and sometimes 
even a sufficient condition for stability. Simply stated the difference schemes 
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in question should have constant coefficients and correspond to pure 
initial value problems with periodic initial data. 

We shall illustrate this theory first by considering rather general explicit 
difference schemes of the form 

m 

(15) U(/ + At, x) = 2 QU(f, x + A*>; 

j= -m 

0 < t < T - At, 0 < X < 2tt. 


Here U(r, x) is, say, a /^-dimensional vector to be determined on some net 
with spacing Ax, A t and the C ; = C ; (Ax, Ar) are matrices of order p 
which are independent of x, f, but, in general, depend upon Ax and A t. 
Initial data are also specified by, say, 

(16) U(0, x) = g(x), 

where g(x + 2tt) = g(x). The difference equations (15) employ data at 
2m + 1 net points on level t in order to compute the new vector at one 
point on level t + A t. We use the assumed periodicity of U(r, x) in order 
to evaluate (15) for x near 0 or 2i t. With no loss in generality then, assume 
that mAx < 2tt, and U(f, x) is defined for all x in 0 < x < 2i r, and 
t = 0, A t, 2A... (see Problem 2). 

Since U(/, x) is to be periodic in x and the C } are constants, we can 
formally construct Fourier series solutions of (15). That is, U(/, x) is of 
the form 

00 

(17) U(f,x)= 2 V{t’ k )e ikx - 

k= - oo 

Upon recalling the orthogonality over [0, 2v ] of e ikx and e iQX for k # q, 
we find that this series satisfies (15) iff 

(18a) V(/ + A t, k) = G(k, Ax, Ar)V(r, k) 

where 

m 

(18b) G(k, Ax, At) = 2 C A Ax ’ At)e tlkAx , |A:| = 0, 1, 2,.... 

j--m 

From (16) and (17), it follows that the V(0, k) are just the Fourier co¬ 
efficients of the initial data, g(x); i.e., 

1 f 2 * 

V(0, k) - 2 ^ J g(x)e ikx dx . 

Repeated application of (18a) now yields 

(19) V(f, it) = G n (k, Ax, A/)V(0, k\ n = f/Ar. 
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The matrices G(k, Ax, At), of order p as defined in (18b) are called the 
amplification matrices of the scheme (15), since they determine the growth 
of the Fourier coefficients of the solutions of (15). 

Because U(f, x) is defined for 0 < x < 2i t, we may introduce as a 
norm 

( l r 2n ^ 1/2 

(20) ||U(0|| s |JL |U(/, x)\> dxj . t = 0, At, 2At,.. ., 

which for each t is called the ^ 2 -norm of U(f, x). (By |U|, we denote the 
Euclidean norm of the vector U.) However, by the Parseval equality (see 
Theorem 5.3 of Chapter 5), or directly by using (17) in (20), we have 

(21) l|U(f)ll = { fc i |V(f, A:)| 2 } /2 . 

One of the main reasons for using the J*? 2 -norm is that it can be simply 
related, as above, to the sum of the squares of the Fourier coefficients. 
Then from (19) and (21), we conclude that 

(22) ||U(0|| < max [<?"(*, Ax, A/)|^ | |V(0, A:)| 2 j /2 

= (max \\G n (k, Ax, Ar)|)||U(0)||, t = Mr, n = 1,2,.... 

k 

The matrix norm ||G n || to be used in (22) is, of course, any norm compat¬ 
ible with the Euclidean vector norm |V|. As previously observed in Section 
1 of Chapter 1 the natural norm induced by this vector norm is the smallest 
such compatible matrix norm and so we shall employ it here. Thus with 
no loss in generality let ||G n || be the spectral norm of G n [see the definition 
in (1.11) of Chapter 1 and the discussion preceding Lemma 1.2 thereof]. 

Let us say, for the present, that the difference scheme (15) and (16) is 
stable (in the J? 2 -norm) iff there exists a constant K , independent of the 
net spacing , such that 

(23) ||U(0|| < /qu(0)||, 0 < t < T, 

for all solutions U(r, x) of (15) and (16) for all g(x) with finite J? 2 -norm. 
But then we see from (22) that stability is a consequence of the uniform 
boundedness of the powers of the amplification matrices. To be more 
precise we introduce the definition: the family of matrices 

'mAx < In 
< 0 < nAt < T 

„ k — 0 , ±1, ± 2, ... 


G n (k, Ax, At) 
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is uniformly bounded if there exists a constant K independent of k , Ax and 
A t such that 


(24) |G"(*,Ax,A/)|| < K 

Now we have the simple 


for mi.v < In 

0 < nAt < T 


|*| = 0 , 1 , 2 ,.... 


theorem 2. The difference equations (15) are stable in the £P 2 -norm iff 
the amplification matrices (18b) satisfy the uniform boundedness condition 
(24). 


Proof As indicated above, (24) and (22) imply (23) thus showing the 
sufficiency. On the other hand, if (24) is not satisfied for any finite K , then 
(23) cannot be satisfied for any finite K , for all U(0, x). That is, given any 
K, a single Fourier term of (17) will be a solution of (15), if (18) is satisfied, 
and can be chosen so that (23) is violated. ■ 


Having established the importance of uniform boundedness we now 
state a simple necessary condition for stability due to von Neumann. 


theorem 3. If the scheme (15) is stable (in the J? 2 -norm) then there 
exists a constant M , independent of the net spacing , such that 


(25) p(G(k, Ax, AO) < 1 4- MAr, for k — 0, ±1, ±2,. . .. 


Proof If (15) is stable, then by Theorem 2 the uniform boundedness 
condition (24) holds. But upon recalling Lemma 1.2 of Chapter 1, we have 

P n (G(k , Ax, AO) = p(G n (k , Ax, Ar)) 

< ||G^,Ax,A0|| 

< K, \k\ — 0, 1, 2,., ., 0 < «Ar < T, wAx < 27r. 
With no loss in generality we may take K > 1 and thus for T = «Ar, 

p(G(k, Ax, At)) < A: 1 '* = K AtlT < 1 + xf At <T. 


The last inequality is established in Problem 3 and with M = K/T the 
result (25) follows. ■ 

We call (25) the von Neumann condition and Theorem 3 shows that it is 
necessary for stability in the if 2 -norm. In some cases it may also be a 
sufficient condition. For instance, if the amplification matrices G(k, Ax, A/) 
are Hermitian, then ||G n || = ||(7|| n — p n (G) and Theorem 2 implies that the 
von Neumann condition is sufficient for stability. In any event, one should 
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always try to compute the eigenvalues of G(k , Ax, A/) to rule out possibly 
unstable schemes. If (25) is valid only for special nets, say with Ax = </>(At), 
then the scheme may only be conditionally stable (see Problems 4 and 5). 

Difference schemes of the form (15) can be applied to quite general 
classes of partial differential equation problems. For example, systems of 
the form (3.25a) can be treated provided the matrix cA is constant and a 
uniformly spaced net is used. More generally, we can treat systems of the 
form 

(26) Lu(t, x ) = x) = 0 

where sf is some differential operator with respect to the x variable and 
has constant coefficients. Thus the heat equation (4.1a) and other higher 
order systems are included. 

The previous analysis is easily extended to more general difference 
schemes. The most obvious such generalization is to implicit schemes which 
may be written as 

(27) ^ B > U (' + Af, x + jAx) = ^ C,U(f, * + jAx), 

l;'|<m \J\<m 

0 < t < T — A/. 

Here the Bj are matrices of order /?, independent of x and t , but dependent, 
in general, on Ax and At, We now need only change the definition of the 
amplification matrices from (18b) to 

(28) G(k, Ax, At) 3(2 B,e mhx \ ( £ C y e WA *), 

1*1 = 0 , 1 , 2 ,... 

and of course require that the indicated inverse exists (see Problem 4). 
We can treat difference schemes with more than two time levels but then 
the amplification matrices are of higher order, say of order pq for q + 1 
time levels. Extensions to more independent variables are not difficult. 

We recall that the stability definition (23) which has been used in this 
subsection is not the same as that in (7), which is employed in the basic 
Theorem 1. However, for the class of equations, say of the form (26), for 
which difference schemes of the form (15) or (27) are appropriate, we can 
show that (23) [or (24)] is equivalent to (7). Specifically, we have 

theorem 4. Let B a = /, the identity operator , and let be defined by 

(29) AtLJJ(t, x) = U(f + At, x) - 2 C,1J(', x + jAx). 

!;l < m 

Then for difference problems of the form (15) and (16) the definition (23) 
of stability is equivalent to that in (7) with appropriate norms . 
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Proof. The appropriate norms to employ in (7) are, in terms of the 
«^ 2 -norm (20), 

||U|| = sup ||U(f„)l> II^aUII = sup ||L a U(4)||, 

0<t n <T 0 <t n <T 

||5 a U|| S ||U(0)|j. 

Then trivially it follows that (7) implies (23) for all solutions U of (15) 
and (16); that is, for all U(/, x) satisfying 

L a U = 0 on D A , i.e., 0 < t < T; 

B A U = U(0, x) = g(x) on C A , i.e., t = 0. 


Now let U(r, x) be any function with convergent Fourier series (17), 
not necessarily a solution of (15). Insert the series on the right-hand side 
of (29), multiply the result by (1/27 T)e~ ikx , and integrate over [0, 2 tt]. 
We obtain with the definition (18b), 

A/ f 271 

\(t + A t, k) = G(k, Ax, A t)\(t, k) + ^ J e~ ikx L A U(t, x) dx. 

By applying this result recursively in t , with the notation G s G{k , Ax, At) 
and nAt = t, it follows that 

V(r, k) = G n V(0, k) + y G v \ e~ tkx L A U(t - vAt, x) dx, 

2?r v = 0 Jo 

0 < nAt < T, \k\ = 0, 1,2,.... 

Upon taking the Euclidean norm of this vector equation, we have 
|V(f, k)\ < max ||G V (A:, Ax, Af)|| 

v< n 

( n “ 1 I ] f*2ji 

x ||V(0, A:)| + A? 2 - J e- ikx L & U(t - vA t, x) dx 

Finally, by using this result in (21) (see Problem 6), we get with the aid 
of the Schwarz inequality, the inequalities 

lab < a 2 + b 2 and a + b > Va 2 + b 2 , 

the estimate 


(30) ||U(0|| < V2 sup ||G»(* t Ax, Ar)!|{||U(0)|| + t max ||L A U(r)||}, 

v< n i<t 

|fc| < OO 

for nAt < T, \k\ = 0, 1,2,.... 

Here we have introduced the «£? :2 -norm of L A U(f, x), as in (20), and used the 
Parseval equality to deduce that 


i ® c 2n 

||^U(r)|| 2 = - 2 e-* k *LMr, x) dx 
^ fc- - 00 Jo 


|2 
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Now let the scheme (15) be stable in the sense of (23). Then by Theorem 
2 the amplification matrices G v (k, Ax, A/) are uniformly bounded, as in 
(24), and we find from (30) that 

||U|| < AT'{||U(0)| + ||L a U||} 

where K' == V2 K max (1, T). But this holds for all (sufficiently smooth) 
functions U(f, x) in 0 < x < 2 tt, 0 < t < T\ and so the stability defined 
in (7) holds for the difference operator L A of (29). ■ 

We conclude with the remark that the condition of stability thus far 
has only been shown to be sufficient for convergence, But, by using ele¬ 
mentary ideas of functional analysis, Lax has proved that for linear well- 
posed initial value problems, the consistent difference schemes (15) or (27) 
are stable iff the schemes are convergent ! 


PROBLEMS, SECTION 5 

n 

1. * With the notation 2" and x k of Subsection 5.1 of Chapter 5, set h — Ax 

j~ - n 

and verify that the 2tt periodic solution W(x y t) of W t — Wj, with W(x ky 0) = 
f{x k ) for k = 0, ± 1,.. ., ±n 7 satisfies 

in n I / A t \ tl*t 12 

2n J. n 1 ^ 0,2 = J .„ K‘ + ' sin H I 

< 2' N 2 - S 2' I 0)| 2 > 

i = -n Zrt k = - n 

where 

Ax n , 

b f — — y f(x k )e~ iJX >c and A t = u Ax 2 . 

Z 7 T k = _ n 

That is, if we define || || by 

ll^(/)p = A r 

zn k = -n 

we have shown that 

11^(01 ^ e^\\W(0)W 

and the difference scheme is stable for A; = /xAx 2 for any constant (see 
Problem 3.5). 

2. Explain how if U(0, x) is given, (15) may be used to define U(A/, x), 
for 0 < x < 2tt —even though (15) is a difference equation. 

3. Verify the inequality, with K > 1, 

K x < 1 + Kx , for 0 < x < 1. 


[Hint: Study /(x) = show that/'W > 0,/"(x) > 0.] 
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4. (a) Show that the amplification matrices for the Crank-Nicolson like 
scheme (4.18) are the scalars 


G(k, Ax, A t) 


1 - (1 - (9)2A(1 - cosArAx) At 

1 + 02A(1 - cos/cAx) * “Ax 2 ’ 


(b) Since this is a case where G is Hermitian, determine by means of the 
von Neumann condition the restrictions on A for stability for any 6 in 
0 < 6 < 1. Compare your results with (4.28). 

5. (a) Find the amplification matrices of the scheme (3.27). Verify that 
the von Neumann condition is satisfied when the Courant condition, A = 
cAt/Ax < 1, is satisfied. 

(b) Apply the von Neumann test to the divergent scheme 


U(x, t + At) = U(x, 0 + 2 0 “ u (* - A *, 0L 


- c :)• 


6. If a(k, ri) — b(k ) + c 2 d(k , 0, for real numbers a , /?, c, and d , show 

v = 0 

that 

[a(k, «)] 2 < 2|[6(*)] 2 + c 2 |^ d(k, v)J 


Hence, show that 


[a(k, n)Y < 2< [b(k)] 2 + nc 2 2 WU* 


[Hint: Apply Schwarz’ inequality: (2 \*d) 2 < (2 0(2 d 2 ).} 


, ^] 2 |. 
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Acceleration, of iterative methods, 
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optimal, 470 
procedure, 120, 151 
Adams method, 388 
improved, 394 

Aitken, 5 2 -method, 102-108, 260 
5 2 -process, 109, 152, 374 
iterative interpolation, 259 
Algorithm, 21 
convergent, 1, 23 

Alternative principle, 29, 422, 436 
Altman, 83 
Analytic(ity), 279 
domain of, 279 
Anti-symmetric, 266 
A posteriori (error) estimate, 26, 46- 
49, 140 

A posteriori test, 123 
A priori (error) estimate, 26, 37-46, 
140, 169, 448, 462 

Backward error analysis, 38 
Bairstow’s method, 131 ff. 

Balanced method, 390, 397 
Bauer, 139 

Bernoulli’s method, 128 ff., 133, 158 
Bernstein polynomials, 183 ff., 192 
Bessel’s inequality, 197, 203, 238 
Best approximation, 221 ff., 477 
trigonometric, 240 ff. 

Binomial expansion, 184 


Biorthogonal, 137 
Biorthogonalization, 154 
Bisection method, 128 
Boundary conditions, 421 ff., 443, 483 
Boundary points, see Net points 
Boundary value problem, 421 ff. 
linear, 421 If. 

Cauchy, problem, 479 ff. 

Schwarz inequality, 5, 146, 219, 220 
sequence, 88 

Centered difference, approximation, 
293 

method, 377 ff. 

Centroid, 357 
Characteristic(s), 479 ff. 
directions, 482 
equation, 129, 134 
homogeneous, 483 
form, 481 ff. 
polynomial, 2, 407 
slope, 487 
variables, 481 ff. 

Chebyshev, 224, 267 
approximations, 211 
polynomials, 81, 152, 203, 209 ff., 
214, 219, 221, 226 ff., 236 
Chopping, 18 

Chord method, 97, 98, 113, 437 
Christoffel-Darboux relation, 205, 333 
Christoffel numbers, 333, 335 
Codiagonal elements, 164 
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Compatibility condition, strong, 514 
weak, 514 

Complete linear space, 194 
Component, 136 
Composite rules, 337 ff. 

Computing problem, 21 
Comrie, 280 

Condition number, 27, 37 
Conjugate transpose, 10 
Consistent, 365, 368, 411, 514, 516 
conditionally, 516 
Contracting mapping, 85 
Convergence, 517 

in the mean, 197, 239 
of difference solutions, 491 ff., 514 ff. 
properties, 365 
Convergence factor, 119 
asymptotic, 91 

Convergent, approximate solutions, 519 
conditionally, 490, 505 
method, 412 

unconditionally, 505, 508 
Corrector, 386 ff. 

Courant, 445 
condition, 489 

Friedrichs-Lewy condition, 489 
Crank-Nicolson scheme, 509, 530 
Crout reduction, 51 
Cubature formula, 353 ff. 

composite, error, 362 
Cyclic, alternating direction method, 
477 

Jacobi scheme, 163 
parameters, 80, 476 ff. 

Dahlquist, 417 
Deflation methods, 152 ff. 

Degree of precision, see Precision 
De la Vallee-Poussin, 223 
5 2 -method, 102-108 
5 2 -process, 109, 152, 158 
Dependence, domain of, 480, 485 ff. 
501 

interval of, 480 
numerical domain of, 487 
Detecting isolated errors, 263 
Difference equation(s), 129, 365 
homogeneous, 406 
linear, 405 ff. 

Difference methods, consistency, 410 ff , 
514 ff. 


Difference methods, convergence, 

410 ff., 514 ff. 
convergent, 518 
stability, 410 ff., 514 ff., 519 
Difference quotients, 445 
backward, 445 
centered, 445 
forward, 445 

Difference(s), average value of, 274 
centered, 272, 283 
forward, 260 ff. 
modified second, 274 
Newton’s backward formula, 283 
operators, calculus of, 281 ff. 
Diffusion equation, 443, 501 ff. 

Direct methods, 427 ff. 

Dirichlet problem, 446 
Discontinuous integrands, 346 ff. 
Discrete orthonormality, 215 
Discretization error, 368 
Distance from a set, 142 
Divergent method, 379 ff. 

Divided differences, 246 ff., 296 
Domain of dependence condition, 489 
see also Dependence 
Double precision, 52 
Dufort-Frankel, 516 
Duhamel’s principle, 409 

Eigenfunctions, 209, 434 ff., 458 ff. 
Eigenpairs, 135 
Eigen-problems, 434 ff. 
Eigenvalue-eigenvector problem, 134— 
175 

Eigenvalue(s), 134 ff., 209, 434 ff., 

458 ff. 

localizing, 135 
principal, 147 

problems, 421 ff., 434 ff., 455 ff. 
Eigenvector(s), 134 ff., 169 
left, 137 
orthonormal, 11 
right, 137 
row, 137 

Elimination, block, 59-61, 463 
group, 59-61 
Equivalent systems, 29 
Error estimates, a posteriori; see A 
posteriori 

a priori; see A priori 
Error factor, 267 
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Error propagation, 91 
Euclidean algorithm, 127 
Euler-Cauchy method, 366 ff., 383, 395, 
405 

convergence, 376 
modified, 388, 394, 402, 405 
Euler-Maclaurin summation formula, 
287, 288, 340 

Everett, see Interpolation polyno¬ 
mial^) 

Explicit schemes, 365, 385, 487 
Exponent, 18 
External points, 444 
Extrapolation, 73 
to zero mesh width, 374, 383 

Factorial polynomial, 265 
Factorization methods, 52-61, 171 ff 
False position, 98, 99-102 
Fibonacci sequence, 101 
Fike, 139 

Finite difference methods, 427 ff. 
eigenvalue problem, 455 ff. 
error estimate, 429 
Finite jump discontinuities, 346 
First order method, 91 
Floating-point arithmetic, 17 
Forsythe, 163 

Forward differences, 260 ff. 

Fourier, 239 

coefficients, 239, 484, 524 
series, 237 ff. 

Francis, 173 
Franklin, 143 
Friedrichs, 445 
Functional, analysis, 1 
iteration, 386 

Fundamental set of solutions, 406 

Gauss-Jordan elimination, 50 
Gauss-Seidel, see Iteration; Line 
Gerschgorin, 135 
Givens, 140, 160, 168 
transformation, 164 ff. 

Goldstine, 49 
Gram polynomials, 214 
Gram-Schmidt orthonormalization 
method, 199, 212, 218 
Grid, 364, 444 

Haar property, 240 
Hadamard, 21, 444 
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Heat conduction equation, 443, 501 ff. 
Henrici, 163 

Hermite, interpolation, 192 
polynomials, 221, 351 
Hermitian, 70 
transpose, 10 

Hessenberg form, 160, 170 
Heun’s method, 402, 405 
Higher order equations, 418 ff. 

Hilbert segments, 196, 217 
Homogeneous problem, 422 
Householder, 160 
method, 165 ff. 

Hyman, 170 
Hyperbolic type, 482 

Ideal method, 366 

Implicit schemes, 365, 385, 392, 505 ff. 
Infinite integrand, 346 ff. 

Infinite integration limits, 350 
Initial conditions, 443 ff., 483 
Initial value methods, 424 ff. 
problem, 364 ff., 422, 443, 479 ff., 
501 ff. 

Inner product, 19, 134, 199, 212 
Instability, 504 ff. 

Interior points, see Net points 
Interpolation formulae, centered, 

270 ff. 

Interpolation polynomial(s), 187 ff., 
264 ff. 

divergence of, 275 
error, 189 ff., 248, 265 ff., 275, 296 
Everett’s form, 273, 280, 319 
Gaussian (forward) form, 271 
Hermite, 208 
Newton, 246 ff. 
osculatory. 192 ff., 255 
remainder, 265 
Interpolator, 302 
Inverse iteration, 152 ff. 

Iteration(s), alternating direction, 

475 ff. 

block, 72, 471 ff. 
extrapolated, 79 
functional, 85 ff. 

Gauss-Seidel, 66, 71, 465 ff., 470 
Gauss-Seidel, accelerated, 470 
higher order, 94 
interpolated, 79 
inverse, 157 
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Iteration(s), Jacobi, 64, 72, 464, 467 
line, 471 ff. 
simultaneous, 64, 464 
successive, 66 
Iterative interpolation, 102 
inverse, 259 
linear, 258 fF. 

Iterative methods, 61-84, 85-133, 

463 ff. 

Jacobian, 113, 420 
Jacobi, method, 160 ff. 

polynomials, 335 
Jordan, 50 
form, 2 

Kantorovich, 119 
Kernel, 198, 205 
Kronecker delta, 27 
Kublanovskaja, 173 
Kutta’s method, 402, 405 

Lagrange, interpolation coefficients, 

189 ff., 218, 264 ff. 

interpolation polynomial, 189 ff., 
218, 246 
multipliers, 322 

Laguerre polynomials, 221, 351 
Laplace equation, 443, 445 ff. 

Laplacian, 445 
Lattice, 364, 444 
Lax, 529 

Least squares approximation, 194 ff, 
237 ff. 

Lebesgue integral, 194 

Legendre polynomials, 202, 206, 218 ff. 

Leibnitz’ rule, 106 

Lemniscates, 279 

Lewy, 445 

Liebmann method, 465 
Line, accelerated successive, 473 
Gauss-Seidel, 473 
Gauss-Seidel, accelerated, 474 
iterations, 471 ff. 

Jacobi method, 471 ff. 

Linear independence, 199 
Linearly independent, 406 
Lipschitz, condition, 86, 183 
constant, 85, 405 
continuity, 22 
continuous, 364, 413 
Lower Hessenberg form, 167 


Majorizing set, 378 
Mantissa, 18 

Matrix, amplification, 525 
approximate inverse, 27 
augmented, 29 
block-tridiagonal, 58 
circulant, 175 

complex conjugate transpose, 137 
convergent, 14, 63 
diagonalizable, 496 
formulation, 452 ff. 

Hermitian, 12, 137, 140 
Hessenberg, 49, 160, 170 
Hilbert segment, 196, 217 
identity, 27 
ill-conditioned, 37 
inverse, 27 
Jacobi, 55 
lower triangular, 31 
permutation, 33 
perturbed, 137 
residual, 47 
singular, 169 
sparse, 156 
symmetric, 151, 436 
transformations, 159 ff. 
triangular, 31 
tridiagonal, 55, 164 ff. 
uniformly bounded, 526 
unitary, 139, 144 
upper triangular, 31 
well-conditioned, 37 
zero, 14 

Maximum absolute column sum, 10 
Maximum absolute row sum, 9 
Maximum norm, 4, 9, 178 
Maximum principle, 431, 439, 447 ff., 
461, 513 

Mean convergence, 197, 239 
Mean square error, 196 
Measure zero, 194 
Mesh, 364, 444 
ratio, 490, 501 
widths, 364 

Midpoint rule, 240, 316, 323 
composite, 343, 344 
Milne’s method, 388, 394 
Minimizing sequence, 244 
Minimum principle, 513 
Minkowski’s inequality, 6 
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Mixed initial-boundary value problem, 
483 ff. 

Mobius transformations, 77 
Modulus of continuity, 183, 343 
Moulton’s method, 388 
Multiple integrals, 352 ff. 

composite formulae, 361 ff. 
Multiplier(s), 33, 35 
Multipoint methods, 102 
Multistep methods, 384 ff. 

Multivariate interpolation, 294 ff. 

Nesting procedure, 124 
Net, 364, 444 
function, 365 

points, boundary, 444, 515 
interior, 444, 515 
slope, 487 
spacing, 364, 444 
spacing change, 393 
uniform, 367 

Neville’s iterated interpolation, 259 
Newton-Cotes formula, 308 ff., 354, 
391 

closed, 308 if., 316, 337 
error, 3 10 ff., 316 
open, 313, 316 

Newton-Raphson method, 133 
Newton’s, identities, 324 

interpolation polynomial, 246 ff., 
283, 296 

method, 84, 97-99, 113, 1 15-1 19, 
126, 131,427,433,441 
Nodes, see Quadrature 
Normal system, 195, 196, 21 1 
Norm(s), 1-17, 176 ff., 211 
compatible, 8 
essentially strict, 220 
Euclidean, 4, 10, 12, 525 
induced, 8 
JS? 2 , 525 

maximum, 4, 9, 178, 221, 240, 450 

natural, 8 

operator, 8 

P, 4, 139 

semi-, 17, 177 ff. 

spectral, 12 

strict, 181, 211 

uniform, 4 

Notation, 444 ff , 448 
subscript, 445, 448 


/ith order method, 96 
Null space, 29, 145 
Numerical, differentiation, 288 ff. 
integration formula, 300 ff. 
quadrature, 300 ff. 

o(l), 109 

One-step methods, 395 ff., 419 
Open formula, 392 
Operational counts, 34 
Operations, 35 

Operator, centered difference, 283 
derivative, 281 
difference, 281, 439 
displacement, 281 
identity, 281 
Laplace, 445 ff. 
linear, 301 

linear interpolation, 461 ff. 
method, 281 ff. 
shift, 282 
Ops., 35 

Optimal parameter, 74, 470 
Optimal (parameter) value, 151 
Orthogonal, 137 
functions, 196 ff. 
polynomials, 203 ff. 

Orthogonality relations, 435 
Orthogonalization method, 152 ff. 
Orthonormal functions, 197 ff., 202 
Osculating polynomial, 192 ff. 

Overflow, 18 
Over-relaxation, 73 

Parseval’s equality, 197, 203, 238, 244, 
525 

Peaceman and Rachford, 475 
Perturbations, 43, 140 
Picard iteration, 85 
Pivot, element, 32 
maximal, 34 

maximal column, 34, 169 
Pivoting, maximal column, 45 
partial, 45 

Pointwise convergence, 205 ff. 

Pointwise error, 412 
Poisson equation, 446 ff. 

Polygon method, 367 ff. 

Positive definite, 17, 49, 457 
systems, 70 
Power method, 147 
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Precision, degree of, 301, 311, 312, 
327 ff., 336, 359, 360, 401 
Predictor, 86 ff. 

Predictor-corrector method, 386 ff., 419 
error estimates, 388 ff. 

Properly posed, 92, 444 
pth order method, 82 

QR (factorization) method, 169, 173 
Quadratic convergence, 113 
Quadrature, coefficients, 300, 353 
continuous functions, 341 ff. 
error, 300 ff., 320 
interpolatory, 303 
formula, composite, 302, 336 ff. 
composite error, 338 
simple, 302, 336, 384 ff., 400 ff. 
Gauss-Chebyshev, 334 ff., 350 
Gauss-Hermite, 351 
Gauss-Laguerre, 352 
Gaussian, 327 ff. 
error, 329 

interpolatory, 303 ff. 
nodes, 300, 331, 353 
periodic functions, 340 ff. 
points, 300 

uniform coefficients, 319 ff. 
weighted, 331 ff., 349 
weighted interpolatory, 332 
weighted Gaussian, 333, 351 

Rachford, 475 
Randomness, 320 
Rank, 29 
column, 29 
row, 29 

Rate of convergence, 64, 91, 148, 464, 
467, 470, 472 
Rational function, 176 
Rayleigh quotient, 142 
Rayleigh-Ritz, 461 
Reg n la Falsi , 101, 260 
classical, 102 
Remainder term, 265 
Residual, 69 
correction, 68 

Richardson’s deferred approach to the 
limit, 374, 383 
Rolle’s theorem, 190, 289 
Romberg’s method, 344 
Root condition, 412 


Roots of unity, 456 
Rotations, two dimensional, 160 ff. 
Rounding, 18, 91 
errors, 517 
unbiased, 18 
Roundoff, 94 
average, 321 
contamination, 153 
error, 38, 264, 319 ff., 374 ff., 451, 
516 

accumulated, 320 
initial, 375 
local, 375 

mean-accumulated, 321 
mean-square error, 322, 336 
root mean-square error, 322 
statistical notions, 320 
uncorrelated, 321 
Runge-Kutta methods, 402, 405 
Runge’s example, 191, 275 
Rutishauser, 172 

Scale, 46 
Schur, 3, 157 

Schwarz’ inequality, see Cauchy- 
Schwarz 

Second order method, 94 
Semi-norm, 17, 177 ff. 

Separation of variables, 359, 455 ff., 
458, 483 ff., 493 
Separation property, 168 
Shooting methods, 424 ff. 

Signal, 479 ff., 501 
Simpson’s rule, 287, 316, 339 
Single-step methods, 395 ff., 402 
Singular, integrals, 346 ff. 

matrix, 169 
Smooth functions, 515 
Spectral radius, 10, 63, 510 
Splitting(s), 62, 74 
family of, 75 

Stability, 24, 365, 413, 514 ff., 519, 
522 ff. 
test, 523 ff. 

Stable, conditionally, 519 
in the j£? 2 -norm, 525 
unconditionally, 519 
weakly, 523 
Star, 446, 502 
Starring, 137 
Stationary values, 459 ff. 
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Steffensen’s method, 103 
Stencil, 446, 502 
Stirling’s formula, 267, 279 
Sturm-Liouville problems, 434 
Sturm sequence(s), 126, 168 
Subdivision, 364 
Summation by parts, 213 
Symmetric, 266 
Symmetric functions, 247 
elementary, 324 
Synthetic division, 125, 131 

Taylor’s series, finite, 397 ff. 

Threshold (s), 163 
scheme, 163 
Total error, 375 

Trapezoidal rule, 239, 316, 318, 339 
end corrections, 343 
extrapolation of, 344 
periodic functions, 340 ff. 

Triangle inequality, 4, 177, 219 
Triangular, array, 297 
decomposition, 52 
Tridiagonal form, 164 ff. 

Trigonometric, approximation, 229 fF. 
interpolation, 230 ff. 
least squares approximation, 237 ff. 
sum, 229 

Trigonometric functions, discrete ortho¬ 
normality, 215 

Truncation error, local, 368, 387, 395, 
411, 449, 490, 497, 507, 508, 
516 

order, 412 


Underflow, 18 

Undetermined coefficients, method of, 
292, 3 15, 317, 333, 356 ff., 380 
Unit ball, 7 

Unstable, 366, 504 ff., 521 
schemes, 382 

Vandermonde determinant, 129, 188, 
193, 242, 291, 317 
Varga, 478 

Variational equation, 427, 441 
Variational principles, 435, 459 ff. 
Vector, compound, 454 
orthonormal, 149 
residual, 48, 140 
zero, 4 

Vector space, complex, 2 
linear, 406 
Volume, 321 

von Neumann, 49, 523 ff. 
condition, 526 

Wave, 479 ff. 

equation, 443, 479 ff. 

Weierstrass, 7, 180 

approximation theorem, 183 ff., 230 
Weight function, 202, 331 
Weighted least squares approximation, 
202 ff. 

Well-posed, computing problem, 1, 22 
problem, 23, 27, 444 
Well-posed(ness), 139 ff. 

Wilkinson, 27, 44, 49, 140, 172, 174 




